Strings and encodings#

A Perl string holds a sequence of integers. What those integers mean is up to you. In a text string they are Unicode code points; in a binary string they are bytes. Perl does not stamp «this is text» on a scalar - you carry that knowledge with the data, and the boundary between the two worlds is drawn by encode and decode.

Text in source code: `use utf8`#

By default, Perl reads your source file as a sequence of bytes. A string literal containing é is a two-byte string "\xc3\xa9", not the one-character string "\x{e9}". That is almost never what you want when you typed é in an editor set to UTF-8.

use utf8 tells the parser that the source file is UTF-8 and that literals should be decoded as they are read:

use utf8;
my $greeting = "café";
length $greeting;                 # 4  - characters

Without use utf8, the same line produces:

my $greeting = "café";
length $greeting;                 # 5  - bytes

Rule of thumb: if your source file contains any non-ASCII character, write use utf8 at the top and save the file as UTF-8. The pragma only governs the source file; it has nothing to do with I/O.

The internal representation#

Perl keeps text strings in a representation of its own choosing. You do not need to know what it is, and you must not depend on it. The only guarantee is:

A text string is a sequence of code points. length counts them, substr indexes them, chr and ord map between characters and their code points.
A binary string is a sequence of bytes. The same functions now count and index bytes.

The utf8 pragma controls the source. There is also a utf8:: namespace of functions for probing and forcing the internal flag - they exist, they are occasionally useful, and in normal application code you should not need them. Stick to encode and decode.

The `Encode` module#

The standard way to convert between text and bytes is the Encode module:

use Encode qw(encode decode);

my $bytes = encode("UTF-8", $text);       # text  → bytes
my $text  = decode("UTF-8", $bytes);      # bytes → text

The first argument names the encoding. UTF-8 is the obvious default for new code, but Encode supports the ISO-8859-* family, every common Windows code page, Shift-JIS, EUC-JP, EUC-KR, GB2312, GBK, Big5, and dozens more.

Decoding is only reliable if you know which encoding the bytes are in. Nothing in the bytes themselves tells you - that information travels in a Content-Type header, a file’s BOM, a protocol’s metadata field, or a convention you have agreed with the producer. If you truly do not know, guess UTF-8 first, then fall back on ISO-8859-1 which at least never fails (every byte is a valid code point in that encoding).

Lossy encodings#

Not every Unicode character survives every encoding. Converting "caf\x{e9}" to ISO-8859-1 works - é has code point 0xe9 which that encoding can hold. Converting an em-dash \x{2014} to ISO-8859-1 does not - the character has no place in the target, and encode replaces it with a substitution marker by default.

use Encode qw(encode);
my $s = "em\x{2014}dash";
my $b = encode("ISO-8859-1", $s);          # "em?dash"

Pass a third argument to encode to change that behaviour - see the Encode module documentation for the full list of Encode::FB_* constants. For most application code, the default substitution is the right compromise.

A complete round-trip#

Putting decoding, processing, and encoding together:

use utf8;
use Encode qw(encode decode);

# bytes arrive from somewhere - HTTP, a file, a socket
my $input_bytes = "caf\xc3\xa9\n";

# step 1: decode to text
my $text = decode("UTF-8", $input_bytes);  # "café\n"

# step 2: process as text
chomp $text;
my $shouty = uc $text;                     # "CAFÉ"

# step 3: encode for output
my $output_bytes = encode("UTF-8", $shouty . "\n");

The text layer is short - one uc call - but that is the layer where length, regex, case conversion, and chomp all work in the units the program means.

When you want I/O to do this automatically#

Doing decode on every read and encode on every write quickly becomes tedious. The next chapter - I/O - shows how to push the conversion into the filehandle itself with the :encoding layer, so ordinary print and readline already speak text.