Encoding and layers#

A filehandle in Perl is not a raw pipe to bytes — it is a stack of PerlIO layers. Each layer transforms data on the way in or out: translating between characters and octets, converting newlines, applying compression, buffering. This page is about the two or three layers you will touch in everyday code.

Why this matters#

When you read a text file, the bytes on disk are not the characters your program manipulates. UTF-8 encodes the character ä as two bytes (0xC3 0xA4). If Perl reads those bytes as if they were already characters, you get two garbage characters — classic “mojibake”. If you write characters without saying how they should be encoded, the reverse happens.

The layer stack is where you tell Perl how to translate. A correct stack for a UTF-8 text file is one line:

open my $fh, "<:encoding(UTF-8)", $path   or die "open $path: $!";

The two places layers can live#

Layers attach to a handle in two ways:

  1. As part of the mode string at open time, combined with the read/write direction:

    open my $fh, "<:encoding(UTF-8)",  $path or die;
    open my $fh, ">:encoding(UTF-8)",  $path or die;
    open my $fh, ">>:encoding(UTF-8)", $path or die;
    
  2. After opening, via binmode:

    open my $fh, "<", $path            or die;
    binmode $fh, ":encoding(UTF-8)"    or die "binmode: $!";
    

The two forms are equivalent for freshly opened handles. binmode is what you reach for on a handle you did not open — STDIN, STDOUT, STDERR, and handles passed in from elsewhere.

The layers you actually need#

Four cover ninety-nine percent of cases:

  • :encoding(UTF-8) — characters in Perl-land, UTF-8 bytes on disk. The name of the encoding is anything Perl’s Encode module understands: "UTF-8", "ISO-8859-15", "Windows-1252", "Shift_JIS", and so on.

  • :utf8 — a looser sibling of :encoding(UTF-8) that skips validation on input. It is faster and more dangerous: malformed bytes become malformed strings that misbehave later. Prefer :encoding(UTF-8) unless you have measured and proven the validation cost matters.

  • :raw — bytes through, no translation. Equivalent to plain binmode with no second argument. Use it for binary formats, for anything you intend to hash or compare byte-exactly, and for data you are about to hand to a format-specific parser.

  • :crlf — converts \r\n to \n on read and back on write. On Linux this is a no-op by default; you add it explicitly when reading files written on Windows and you want \n-terminated lines in memory.

Layers compose in writing order:

open my $fh, "<:raw:encoding(UTF-8)", $path   or die;

Reads as: strip to raw bytes first, then decode them as UTF-8. The :raw prefix is common when you want a predictable baseline to layer on top of, independent of whatever the open pragma set as the default.

A sensible default for text programs#

If a program deals with UTF-8 text throughout, the cleanest move is to declare that at the top:

use open qw< :std :encoding(UTF-8) >;
  • :std applies the given layer to STDIN, STDOUT, STDERR.

  • :encoding(UTF-8) becomes the default layer for every subsequent open call inside the file.

After that pragma, bare three-argument opens decode UTF-8 automatically:

use open qw< :std :encoding(UTF-8) >;

open my $fh, "<", $path   or die "open $path: $!";
while (my $line = <$fh>) {
    # $line is a character string, already decoded
}

Do not use a bare "<" without having either set a default encoding or explicitly planned for raw bytes. Reading without an announced encoding gives you the “Latin-1 superset” that looks right for ASCII and breaks on the first non-ASCII byte.

Binary files#

For binary data, declare raw mode up front:

open my $in,  "<:raw", $in_path   or die "open $in_path: $!";
open my $out, ">:raw", $out_path  or die "open $out_path: $!";

my $buf;
while (my $n = read $in, $buf, 65536) {
    print $out $buf or die "write: $!";
}
close $in  or die;
close $out or die "close $out_path: $!";

read is the buffered fixed-size reader; readline is the line-oriented one. For binary data you almost always want read.

The explicit :raw on the output is the belt-and-braces move: if someone added use open ":encoding(UTF-8)" to the module three years from now, your binary writer would start mangling bytes. The layer on the handle overrides the pragma.

Changing a handle’s layers after open#

binmode without a layer argument strips all encoding layers, reducing the handle to raw bytes. With a layer argument, it pushes the given layer onto the stack:

binmode STDOUT;                       # raw bytes
binmode STDOUT, ":encoding(UTF-8)";   # decode strings as UTF-8
binmode STDIN,  ":encoding(MacRoman)";

Call binmode before any data moves through the handle. Pushing a layer mid-stream is technically possible but the transitional byte handling is brittle.

Next#

The file-on-disk case is covered. Next: the same open call, but with a running program on the other end. See Pipes.