```{index} single: PerlIO layers, single: encoding, single: layered I/O, pair: encoding; filehandle ``` # Encoding and layers A filehandle in Perl is not a raw pipe to bytes — it is a stack of PerlIO layers. Each layer transforms data on the way in or out: translating between characters and octets, converting newlines, applying compression, buffering. This page is about the two or three layers you will touch in everyday code. ```{index} single: UTF-8, single: mojibake, single: characters vs bytes ``` ## Why this matters When you read a text file, the bytes on disk are not the characters your program manipulates. UTF-8 encodes the character `ä` as two bytes (`0xC3 0xA4`). If Perl reads those bytes as if they were already characters, you get two garbage characters — classic "mojibake". If you write characters without saying how they should be encoded, the reverse happens. The layer stack is where you tell Perl how to translate. A correct stack for a UTF-8 text file is one line: ```perl open my $fh, "<:encoding(UTF-8)", $path or die "open $path: $!"; ``` ```{index} single: binmode, single: PerlIO layers; at open, single: PerlIO layers; via binmode ``` ## The two places layers can live Layers attach to a handle in two ways: 1. As part of the mode string at `open` time, combined with the read/write direction: ```perl open my $fh, "<:encoding(UTF-8)", $path or die; open my $fh, ">:encoding(UTF-8)", $path or die; open my $fh, ">>:encoding(UTF-8)", $path or die; ``` 2. After opening, via [`binmode`](../../p5/core/perlfunc/binmode): ```perl open my $fh, "<", $path or die; binmode $fh, ":encoding(UTF-8)" or die "binmode: $!"; ``` The two forms are equivalent for freshly opened handles. `binmode` is what you reach for on a handle you did not open — `STDIN`, `STDOUT`, `STDERR`, and handles passed in from elsewhere. ```{index} single: PerlIO layers; :encoding, single: PerlIO layers; :utf8, single: PerlIO layers; :raw, single: PerlIO layers; :crlf, single: Encode module, single: newline translation ``` ## The layers you actually need Four cover ninety-nine percent of cases: - `:encoding(UTF-8)` — characters in Perl-land, UTF-8 bytes on disk. The name of the encoding is anything Perl's `Encode` module understands: `"UTF-8"`, `"ISO-8859-15"`, `"Windows-1252"`, `"Shift_JIS"`, and so on. - `:utf8` — a looser sibling of `:encoding(UTF-8)` that skips validation on input. It is faster and more dangerous: malformed bytes become malformed strings that misbehave later. Prefer `:encoding(UTF-8)` unless you have measured and proven the validation cost matters. - `:raw` — bytes through, no translation. Equivalent to plain [`binmode`](../../p5/core/perlfunc/binmode) with no second argument. Use it for binary formats, for anything you intend to hash or compare byte-exactly, and for data you are about to hand to a format-specific parser. - `:crlf` — converts `\r\n` to `\n` on read and back on write. On Linux this is a no-op by default; you add it explicitly when reading files written on Windows and you want `\n`-terminated lines in memory. Layers compose in writing order: ```perl open my $fh, "<:raw:encoding(UTF-8)", $path or die; ``` Reads as: strip to raw bytes first, then decode them as UTF-8. The `:raw` prefix is common when you want a predictable baseline to layer on top of, independent of whatever the `open` pragma set as the default. ```{index} single: open pragma, single: use open, single: PerlIO layers; :std ``` ## A sensible default for text programs If a program deals with UTF-8 text throughout, the cleanest move is to declare that at the top: ```perl use open qw< :std :encoding(UTF-8) >; ``` - `:std` applies the given layer to `STDIN`, `STDOUT`, `STDERR`. - `:encoding(UTF-8)` becomes the default layer for every subsequent `open` call inside the file. After that pragma, bare three-argument opens decode UTF-8 automatically: ```perl use open qw< :std :encoding(UTF-8) >; open my $fh, "<", $path or die "open $path: $!"; while (my $line = <$fh>) { # $line is a character string, already decoded } ``` Do not use a bare `"<"` without having either set a default encoding or explicitly planned for raw bytes. Reading without an announced encoding gives you the "Latin-1 superset" that looks right for ASCII and breaks on the first non-ASCII byte. ```{index} single: binary files, single: read; buffered, single: PerlIO layers; :raw ``` ## Binary files For binary data, declare raw mode up front: ```perl open my $in, "<:raw", $in_path or die "open $in_path: $!"; open my $out, ">:raw", $out_path or die "open $out_path: $!"; my $buf; while (my $n = read $in, $buf, 65536) { print $out $buf or die "write: $!"; } close $in or die; close $out or die "close $out_path: $!"; ``` `read` is the buffered fixed-size reader; `readline` is the line-oriented one. For binary data you almost always want `read`. The explicit `:raw` on the output is the belt-and-braces move: if someone added `use open ":encoding(UTF-8)"` to the module three years from now, your binary writer would start mangling bytes. The layer on the handle overrides the pragma. ```{index} single: binmode; strip layers, single: binmode; push layer ``` ## Changing a handle's layers after open `binmode` without a layer argument strips all encoding layers, reducing the handle to raw bytes. With a layer argument, it pushes the given layer onto the stack: ```perl binmode STDOUT; # raw bytes binmode STDOUT, ":encoding(UTF-8)"; # decode strings as UTF-8 binmode STDIN, ":encoding(MacRoman)"; ``` Call `binmode` before any data moves through the handle. Pushing a layer mid-stream is technically possible but the transitional byte handling is brittle. ## Next The file-on-disk case is covered. Next: the same `open` call, but with a running program on the other end. See [Pipes](pipes).