--- name: Unicode in I/O --- # I/O The filehandle is the boundary between your program's text world and the outside world of bytes. Getting Unicode right means setting that boundary up correctly at [`open`](../../p5/core/perlfunc/open) time (or with [`binmode`](../../p5/core/perlfunc/binmode)), so that ordinary [`print`](../../p5/core/perlfunc/print) and [`readline`](../../p5/core/perlfunc/readline) already speak the right encoding. ## PerlIO layers Every filehandle carries a stack of *layers* — small pieces of machinery that transform bytes as they cross the boundary. The relevant ones for Unicode are: - `:raw` — no transformation. Bytes in, bytes out. - `:utf8` — check that bytes read form valid UTF-8 and mark the string as text. Trust the user not to produce invalid UTF-8 on writes. - `:encoding(ENCODING)` — decode on read, encode on write, for any named encoding. Rejects malformed input by default. - `:crlf` — translate `"\r\n"` to `"\n"` on read, `"\n"` to `"\r\n"` on write. On Linux this is inactive; on Windows it is the default for text handles. `:encoding(UTF-8)` is what you almost always want. Prefer it over the bare `:utf8` layer — the `:encoding(...)` form validates, `:utf8` does not. ## Opening a file with an encoding The three-argument form of [`open`](../../p5/core/perlfunc/open) takes a mode that can include layers: ```perl open my $in, "<:encoding(UTF-8)", "input.txt" or die "open: $!"; open my $out, ">:encoding(UTF-8)", "output.txt" or die "open: $!"; while (my $line = <$in>) { chomp $line; # characters, not bytes print $out uc $line, "\n"; # encoded on the way out } ``` Every [`readline`](../../p5/core/perlfunc/readline) from `$in` returns a text string. Every [`print`](../../p5/core/perlfunc/print) to `$out` is encoded to UTF-8 on the wire. The body of the loop sees only characters — no `decode`, no `encode`, no counting bytes by mistake. ## Retrofitting with `binmode` If the filehandle is already open — for example the standard streams, or a handle handed to you by another module — attach the layer with [`binmode`](../../p5/core/perlfunc/binmode): ```perl binmode STDIN, ":encoding(UTF-8)"; binmode STDOUT, ":encoding(UTF-8)"; binmode STDERR, ":encoding(UTF-8)"; ``` Do this once, at the top of the program, before any I/O on those handles. After that the standard streams behave the same as a file opened with an encoding layer. ## The `use open` pragma Writing the same `:encoding(UTF-8)` layer on every [`open`](../../p5/core/perlfunc/open) is noisy. The `open` pragma sets the default layers for all subsequent opens in the enclosing scope: ```perl use open ":std", ":encoding(UTF-8)"; ``` The `:std` tag also re-layers `STDIN`, `STDOUT`, and `STDERR` on the spot, so you get the effect of three [`binmode`](../../p5/core/perlfunc/binmode) calls and a default for every later [`open`](../../p5/core/perlfunc/open) in one line. This is the one-liner at the top of a modern Unicode-aware script: ```perl use utf8; use open ":std", ":encoding(UTF-8)"; ``` - `use utf8` — literals in this source file are UTF-8. - `use open ":std", ":encoding(UTF-8)"` — stdin/stdout/stderr and every later open default to UTF-8 on the wire. ## Binary I/O alongside text I/O Some handles carry bytes, not text — image files, compressed streams, network protocols that frame their own encoding. Open those `:raw` and never attach an encoding: ```perl open my $img, "<:raw", "photo.jpg" or die "open: $!"; ``` Within one program, some handles can be text (`:encoding(UTF-8)`) and others binary (`:raw`). Keep the labels straight: a binary handle gives you byte strings, a text handle gives you text strings, and `$!` is the same either way. ## Interactive input A terminal does not tell you what encoding it is typing in. On modern Linux it is almost always UTF-8; on Windows the default console code page is still commonly CP1252 or CP437. The usual assumption for cross-platform scripts is UTF-8, with [`binmode`](../../p5/core/perlfunc/binmode) applied at startup: ```perl binmode STDIN, ":encoding(UTF-8)"; ``` If your program must support legacy terminals, decode based on the `LANG` or `LC_CTYPE` environment variable, or accept a `--encoding` flag. There is no portable auto-detection. ## Checking layers To see which layers are on a handle, use `PerlIO::get_layers`: ```perl use PerlIO; my @layers = PerlIO::get_layers(\*STDOUT); print "@layers\n"; # e.g. unix perlio encoding(utf-8-strict) ``` Useful when a program is silently producing double-encoded output and you want to find out which handle has the surprise layer on it.