Encoding and layers#
A filehandle in Perl is not a raw pipe to bytes — it is a stack of PerlIO layers. Each layer transforms data on the way in or out: translating between characters and octets, converting newlines, applying compression, buffering. This page is about the two or three layers you will touch in everyday code.
Why this matters#
When you read a text file, the bytes on disk are not the characters
your program manipulates. UTF-8 encodes the character ä as two
bytes (0xC3 0xA4). If Perl reads those bytes as if they were
already characters, you get two garbage characters — classic
“mojibake”. If you write characters without saying how they should
be encoded, the reverse happens.
The layer stack is where you tell Perl how to translate. A correct stack for a UTF-8 text file is one line:
open my $fh, "<:encoding(UTF-8)", $path or die "open $path: $!";
The two places layers can live#
Layers attach to a handle in two ways:
As part of the mode string at
opentime, combined with the read/write direction:open my $fh, "<:encoding(UTF-8)", $path or die; open my $fh, ">:encoding(UTF-8)", $path or die; open my $fh, ">>:encoding(UTF-8)", $path or die;
After opening, via
binmode:open my $fh, "<", $path or die; binmode $fh, ":encoding(UTF-8)" or die "binmode: $!";
The two forms are equivalent for freshly opened handles. binmode
is what you reach for on a handle you did not open — STDIN,
STDOUT, STDERR, and handles passed in from elsewhere.
The layers you actually need#
Four cover ninety-nine percent of cases:
:encoding(UTF-8)— characters in Perl-land, UTF-8 bytes on disk. The name of the encoding is anything Perl’sEncodemodule understands:"UTF-8","ISO-8859-15","Windows-1252","Shift_JIS", and so on.:utf8— a looser sibling of:encoding(UTF-8)that skips validation on input. It is faster and more dangerous: malformed bytes become malformed strings that misbehave later. Prefer:encoding(UTF-8)unless you have measured and proven the validation cost matters.:raw— bytes through, no translation. Equivalent to plainbinmodewith no second argument. Use it for binary formats, for anything you intend to hash or compare byte-exactly, and for data you are about to hand to a format-specific parser.:crlf— converts\r\nto\non read and back on write. On Linux this is a no-op by default; you add it explicitly when reading files written on Windows and you want\n-terminated lines in memory.
Layers compose in writing order:
open my $fh, "<:raw:encoding(UTF-8)", $path or die;
Reads as: strip to raw bytes first, then decode them as UTF-8.
The :raw prefix is common when you want a predictable baseline
to layer on top of, independent of whatever the open pragma set
as the default.
A sensible default for text programs#
If a program deals with UTF-8 text throughout, the cleanest move is to declare that at the top:
use open qw< :std :encoding(UTF-8) >;
:stdapplies the given layer toSTDIN,STDOUT,STDERR.:encoding(UTF-8)becomes the default layer for every subsequentopencall inside the file.
After that pragma, bare three-argument opens decode UTF-8 automatically:
use open qw< :std :encoding(UTF-8) >;
open my $fh, "<", $path or die "open $path: $!";
while (my $line = <$fh>) {
# $line is a character string, already decoded
}
Do not use a bare "<" without having either set a default encoding
or explicitly planned for raw bytes. Reading without an announced
encoding gives you the “Latin-1 superset” that looks right for ASCII
and breaks on the first non-ASCII byte.
Binary files#
For binary data, declare raw mode up front:
open my $in, "<:raw", $in_path or die "open $in_path: $!";
open my $out, ">:raw", $out_path or die "open $out_path: $!";
my $buf;
while (my $n = read $in, $buf, 65536) {
print $out $buf or die "write: $!";
}
close $in or die;
close $out or die "close $out_path: $!";
read is the buffered fixed-size reader; readline is the
line-oriented one. For binary data you almost always want read.
The explicit :raw on the output is the belt-and-braces move: if
someone added use open ":encoding(UTF-8)" to the module three
years from now, your binary writer would start mangling bytes.
The layer on the handle overrides the pragma.
Changing a handle’s layers after open#
binmode without a layer argument strips all encoding layers,
reducing the handle to raw bytes. With a layer argument, it
pushes the given layer onto the stack:
binmode STDOUT; # raw bytes
binmode STDOUT, ":encoding(UTF-8)"; # decode strings as UTF-8
binmode STDIN, ":encoding(MacRoman)";
Call binmode before any data moves through the handle. Pushing a
layer mid-stream is technically possible but the transitional byte
handling is brittle.
Next#
The file-on-disk case is covered. Next: the same open call, but
with a running program on the other end. See Pipes.