I/O#
The filehandle is the boundary between your program’s text world
and the outside world of bytes. Getting Unicode right means setting
that boundary up correctly at open
time (or with binmode), so that
ordinary print and
readline already speak the
right encoding.
PerlIO layers#
Every filehandle carries a stack of layers — small pieces of machinery that transform bytes as they cross the boundary. The relevant ones for Unicode are:
:raw— no transformation. Bytes in, bytes out.:utf8— check that bytes read form valid UTF-8 and mark the string as text. Trust the user not to produce invalid UTF-8 on writes.:encoding(ENCODING)— decode on read, encode on write, for any named encoding. Rejects malformed input by default.:crlf— translate"\r\n"to"\n"on read,"\n"to"\r\n"on write. On Linux this is inactive; on Windows it is the default for text handles.
:encoding(UTF-8) is what you almost always want. Prefer it over
the bare :utf8 layer — the :encoding(...) form validates,
:utf8 does not.
Opening a file with an encoding#
The three-argument form of open
takes a mode that can include layers:
open my $in, "<:encoding(UTF-8)", "input.txt" or die "open: $!";
open my $out, ">:encoding(UTF-8)", "output.txt" or die "open: $!";
while (my $line = <$in>) {
chomp $line; # characters, not bytes
print $out uc $line, "\n"; # encoded on the way out
}
Every readline from $in
returns a text string. Every print
to $out is encoded to UTF-8 on the wire. The body of the loop
sees only characters — no decode, no encode, no counting bytes
by mistake.
Retrofitting with binmode#
If the filehandle is already open — for example the standard
streams, or a handle handed to you by another module — attach the
layer with binmode:
binmode STDIN, ":encoding(UTF-8)";
binmode STDOUT, ":encoding(UTF-8)";
binmode STDERR, ":encoding(UTF-8)";
Do this once, at the top of the program, before any I/O on those handles. After that the standard streams behave the same as a file opened with an encoding layer.
The use open pragma#
Writing the same :encoding(UTF-8) layer on every
open is noisy. The open pragma
sets the default layers for all subsequent opens in the enclosing
scope:
use open ":std", ":encoding(UTF-8)";
The :std tag also re-layers STDIN, STDOUT, and STDERR on
the spot, so you get the effect of three
binmode calls and a default for
every later open in one line.
This is the one-liner at the top of a modern Unicode-aware script:
use utf8;
use open ":std", ":encoding(UTF-8)";
use utf8— literals in this source file are UTF-8.use open ":std", ":encoding(UTF-8)"— stdin/stdout/stderr and every later open default to UTF-8 on the wire.
Binary I/O alongside text I/O#
Some handles carry bytes, not text — image files, compressed
streams, network protocols that frame their own encoding. Open
those :raw and never attach an encoding:
open my $img, "<:raw", "photo.jpg" or die "open: $!";
Within one program, some handles can be text (:encoding(UTF-8))
and others binary (:raw). Keep the labels straight: a binary
handle gives you byte strings, a text handle gives you text
strings, and $! is the same either way.
Interactive input#
A terminal does not tell you what encoding it is typing in. On
modern Linux it is almost always UTF-8; on Windows the default
console code page is still commonly CP1252 or CP437. The usual
assumption for cross-platform scripts is UTF-8, with
binmode applied at startup:
binmode STDIN, ":encoding(UTF-8)";
If your program must support legacy terminals, decode based on the
LANG or LC_CTYPE environment variable, or accept a --encoding
flag. There is no portable auto-detection.
Checking layers#
To see which layers are on a handle, use PerlIO::get_layers:
use PerlIO;
my @layers = PerlIO::get_layers(\*STDOUT);
print "@layers\n"; # e.g. unix perlio encoding(utf-8-strict)
Useful when a program is silently producing double-encoded output and you want to find out which handle has the surprise layer on it.