--- name: Strings and encodings --- # Strings and encodings A Perl string holds a sequence of integers. What those integers *mean* is up to you. In a text string they are Unicode code points; in a binary string they are bytes. Perl does not stamp "this is text" on a scalar — you carry that knowledge with the data, and the boundary between the two worlds is drawn by `encode` and `decode`. ## Text in source code: `use utf8` By default, Perl reads your source file as a sequence of bytes. A string literal containing `é` is a two-byte string `"\xc3\xa9"`, not the one-character string `"\x{e9}"`. That is almost never what you want when you typed `é` in an editor set to UTF-8. `use utf8` tells the parser that the source file is UTF-8 and that literals should be decoded as they are read: ```perl use utf8; my $greeting = "café"; length $greeting; # 4 — characters ``` Without `use utf8`, the same line produces: ```perl my $greeting = "café"; length $greeting; # 5 — bytes ``` Rule of thumb: if your source file contains any non-ASCII character, write `use utf8` at the top and save the file as UTF-8. The pragma only governs the source file; it has nothing to do with I/O. ## The internal representation Perl keeps text strings in a representation of its own choosing. You do not need to know what it is, and you must not depend on it. The only guarantee is: - A text string is a sequence of code points. [`length`](../../p5/core/perlfunc/length) counts them, [`substr`](../../p5/core/perlfunc/substr) indexes them, [`chr`](../../p5/core/perlfunc/chr) and [`ord`](../../p5/core/perlfunc/ord) map between characters and their code points. - A binary string is a sequence of bytes. The same functions now count and index bytes. The `utf8` pragma controls the *source*. There is also a `utf8::` namespace of functions for probing and forcing the internal flag — they exist, they are occasionally useful, and in normal application code you should not need them. Stick to `encode` and `decode`. ## The `Encode` module The standard way to convert between text and bytes is the `Encode` module: ```perl use Encode qw(encode decode); my $bytes = encode("UTF-8", $text); # text → bytes my $text = decode("UTF-8", $bytes); # bytes → text ``` The first argument names the encoding. UTF-8 is the obvious default for new code, but `Encode` supports the `ISO-8859-*` family, every common Windows code page, Shift-JIS, EUC-JP, EUC-KR, GB2312, GBK, Big5, and dozens more. Decoding is only reliable if you know which encoding the bytes are in. Nothing in the bytes themselves tells you — that information travels in a `Content-Type` header, a file's BOM, a protocol's metadata field, or a convention you have agreed with the producer. If you truly do not know, guess UTF-8 first, then fall back on `ISO-8859-1` which at least never fails (every byte is a valid code point in that encoding). ### Lossy encodings Not every Unicode character survives every encoding. Converting `"caf\x{e9}"` to `ISO-8859-1` works — `é` has code point `0xe9` which that encoding can hold. Converting an em-dash `\x{2014}` to `ISO-8859-1` does not — the character has no place in the target, and `encode` replaces it with a substitution marker by default. ```perl use Encode qw(encode); my $s = "em\x{2014}dash"; my $b = encode("ISO-8859-1", $s); # "em?dash" ``` Pass a third argument to `encode` to change that behaviour — see the `Encode` module documentation for the full list of `Encode::FB_*` constants. For most application code, the default substitution is the right compromise. ## A complete round-trip Putting decoding, processing, and encoding together: ```perl use utf8; use Encode qw(encode decode); # bytes arrive from somewhere — HTTP, a file, a socket my $input_bytes = "caf\xc3\xa9\n"; # step 1: decode to text my $text = decode("UTF-8", $input_bytes); # "café\n" # step 2: process as text chomp $text; my $shouty = uc $text; # "CAFÉ" # step 3: encode for output my $output_bytes = encode("UTF-8", $shouty . "\n"); ``` The text layer is short — one `uc` call — but that is the layer where [`length`](../../p5/core/perlfunc/length), regex, case conversion, and [`chomp`](../../p5/core/perlfunc/chomp) all work in the units the program means. ## When you want I/O to do this automatically Doing `decode` on every read and `encode` on every write quickly becomes tedious. The next chapter — [I/O](io) — shows how to push the conversion into the filehandle itself with the `:encoding` layer, so ordinary [`print`](../../p5/core/perlfunc/print) and [`readline`](../../p5/core/perlfunc/readline) already speak text.