Strings and encodings#
A Perl string holds a sequence of integers. What those integers
mean is up to you. In a text string they are Unicode code points;
in a binary string they are bytes. Perl does not stamp “this is
text” on a scalar — you carry that knowledge with the data, and the
boundary between the two worlds is drawn by encode and decode.
Text in source code: use utf8#
By default, Perl reads your source file as a sequence of bytes. A
string literal containing é is a two-byte string "\xc3\xa9", not
the one-character string "\x{e9}". That is almost never what you
want when you typed é in an editor set to UTF-8.
use utf8 tells the parser that the source file is UTF-8 and that
literals should be decoded as they are read:
use utf8;
my $greeting = "café";
length $greeting; # 4 — characters
Without use utf8, the same line produces:
my $greeting = "café";
length $greeting; # 5 — bytes
Rule of thumb: if your source file contains any non-ASCII character,
write use utf8 at the top and save the file as UTF-8. The pragma
only governs the source file; it has nothing to do with I/O.
The internal representation#
Perl keeps text strings in a representation of its own choosing. You do not need to know what it is, and you must not depend on it. The only guarantee is:
A text string is a sequence of code points.
lengthcounts them,substrindexes them,chrandordmap between characters and their code points.A binary string is a sequence of bytes. The same functions now count and index bytes.
The utf8 pragma controls the source. There is also a utf8::
namespace of functions for probing and forcing the internal flag —
they exist, they are occasionally useful, and in normal application
code you should not need them. Stick to encode and decode.
The Encode module#
The standard way to convert between text and bytes is the Encode
module:
use Encode qw(encode decode);
my $bytes = encode("UTF-8", $text); # text → bytes
my $text = decode("UTF-8", $bytes); # bytes → text
The first argument names the encoding. UTF-8 is the obvious default
for new code, but Encode supports the ISO-8859-* family, every
common Windows code page, Shift-JIS, EUC-JP, EUC-KR, GB2312, GBK,
Big5, and dozens more.
Decoding is only reliable if you know which encoding the bytes are
in. Nothing in the bytes themselves tells you — that information
travels in a Content-Type header, a file’s BOM, a protocol’s
metadata field, or a convention you have agreed with the producer.
If you truly do not know, guess UTF-8 first, then fall back on
ISO-8859-1 which at least never fails (every byte is a valid
code point in that encoding).
Lossy encodings#
Not every Unicode character survives every encoding. Converting
"caf\x{e9}" to ISO-8859-1 works — é has code point 0xe9
which that encoding can hold. Converting an em-dash \x{2014} to
ISO-8859-1 does not — the character has no place in the target,
and encode replaces it with a substitution marker by default.
use Encode qw(encode);
my $s = "em\x{2014}dash";
my $b = encode("ISO-8859-1", $s); # "em?dash"
Pass a third argument to encode to change that behaviour — see the
Encode module documentation for the full list of Encode::FB_*
constants. For most application code, the default substitution is
the right compromise.
A complete round-trip#
Putting decoding, processing, and encoding together:
use utf8;
use Encode qw(encode decode);
# bytes arrive from somewhere — HTTP, a file, a socket
my $input_bytes = "caf\xc3\xa9\n";
# step 1: decode to text
my $text = decode("UTF-8", $input_bytes); # "café\n"
# step 2: process as text
chomp $text;
my $shouty = uc $text; # "CAFÉ"
# step 3: encode for output
my $output_bytes = encode("UTF-8", $shouty . "\n");
The text layer is short — one uc call — but that is the layer
where length, regex, case
conversion, and chomp all work in
the units the program means.
When you want I/O to do this automatically#
Doing decode on every read and encode on every write quickly
becomes tedious. The next chapter — I/O — shows how to push
the conversion into the filehandle itself with the :encoding
layer, so ordinary print and
readline already speak text.