Strings and encodings#
A Perl string holds a sequence of integers. What those integers mean is up to you. In a text string they are Unicode code points; in a binary string they are bytes. Perl does not stamp «this is text» on a scalar — you carry that knowledge with the data, and the boundary between the two worlds is drawn by encode and decode.
Text in source code: use utf8#
By default, Perl reads your source file as a sequence of bytes. A string literal containing é is a two-byte string "\xc3\xa9", not the one-character string "\x{e9}". That is almost never what you want when you typed é in an editor set to UTF-8.
use utf8 tells the parser that the source file is UTF-8 and that literals should be decoded as they are read:
use utf8;
my $greeting = "café";
length $greeting; # 4 — characters
Without use utf8, the same line produces:
my $greeting = "café";
length $greeting; # 5 — bytes
Rule of thumb: if your source file contains any non-ASCII character, write use utf8 at the top and save the file as UTF-8. The pragma only governs the source file; it has nothing to do with I/O.
The internal representation#
Perl keeps text strings in a representation of its own choosing. You do not need to know what it is, and you must not depend on it. The only guarantee is:
A text string is a sequence of code points.
lengthcounts them,substrindexes them,chrandordmap between characters and their code points.A binary string is a sequence of bytes. The same functions now count and index bytes.
The utf8 pragma controls the source. There is also a utf8:: namespace of functions for probing and forcing the internal flag — they exist, they are occasionally useful, and in normal application code you should not need them. Stick to encode and decode.
The Encode module#
The standard way to convert between text and bytes is the Encode module:
use Encode qw(encode decode);
my $bytes = encode("UTF-8", $text); # text → bytes
my $text = decode("UTF-8", $bytes); # bytes → text
The first argument names the encoding. UTF-8 is the obvious default for new code, but Encode supports the ISO-8859-* family, every common Windows code page, Shift-JIS, EUC-JP, EUC-KR, GB2312, GBK, Big5, and dozens more.
Decoding is only reliable if you know which encoding the bytes are in. Nothing in the bytes themselves tells you — that information travels in a Content-Type header, a file’s BOM, a protocol’s metadata field, or a convention you have agreed with the producer. If you truly do not know, guess UTF-8 first, then fall back on ISO-8859-1 which at least never fails (every byte is a valid code point in that encoding).
Lossy encodings#
Not every Unicode character survives every encoding. Converting "caf\x{e9}" to ISO-8859-1 works — é has code point 0xe9 which that encoding can hold. Converting an em-dash \x{2014} to ISO-8859-1 does not — the character has no place in the target, and encode replaces it with a substitution marker by default.
use Encode qw(encode);
my $s = "em\x{2014}dash";
my $b = encode("ISO-8859-1", $s); # "em?dash"
Pass a third argument to encode to change that behaviour — see the Encode module documentation for the full list of Encode::FB_* constants. For most application code, the default substitution is the right compromise.
A complete round-trip#
Putting decoding, processing, and encoding together:
use utf8;
use Encode qw(encode decode);
# bytes arrive from somewhere — HTTP, a file, a socket
my $input_bytes = "caf\xc3\xa9\n";
# step 1: decode to text
my $text = decode("UTF-8", $input_bytes); # "café\n"
# step 2: process as text
chomp $text;
my $shouty = uc $text; # "CAFÉ"
# step 3: encode for output
my $output_bytes = encode("UTF-8", $shouty . "\n");
The text layer is short — one uc call — but that is the layer where length, regex, case conversion, and chomp all work in the units the program means.
When you want I/O to do this automatically#
Doing decode on every read and encode on every write quickly becomes tedious. The next chapter — I/O — shows how to push the conversion into the filehandle itself with the :encoding layer, so ordinary print and readline already speak text.