I/O#

The filehandle is the boundary between your program’s text world and the outside world of bytes. Getting Unicode right means setting that boundary up correctly at open time (or with binmode), so that ordinary print and readline already speak the right encoding.

PerlIO layers#

Every filehandle carries a stack of layers — small pieces of machinery that transform bytes as they cross the boundary. The relevant ones for Unicode are:

  • :raw — no transformation. Bytes in, bytes out.

  • :utf8 — check that bytes read form valid UTF-8 and mark the string as text. Trust the user not to produce invalid UTF-8 on writes.

  • :encoding(ENCODING) — decode on read, encode on write, for any named encoding. Rejects malformed input by default.

  • :crlf — translate "\r\n" to "\n" on read, "\n" to "\r\n" on write. On Linux this is inactive; on Windows it is the default for text handles.

:encoding(UTF-8) is what you almost always want. Prefer it over the bare :utf8 layer — the :encoding(...) form validates, :utf8 does not.

Opening a file with an encoding#

The three-argument form of open takes a mode that can include layers:

open my $in,  "<:encoding(UTF-8)", "input.txt"  or die "open: $!";
open my $out, ">:encoding(UTF-8)", "output.txt" or die "open: $!";

while (my $line = <$in>) {
    chomp $line;                        # characters, not bytes
    print $out uc $line, "\n";          # encoded on the way out
}

Every readline from $in returns a text string. Every print to $out is encoded to UTF-8 on the wire. The body of the loop sees only characters — no decode, no encode, no counting bytes by mistake.

Retrofitting with binmode#

If the filehandle is already open — for example the standard streams, or a handle handed to you by another module — attach the layer with binmode:

binmode STDIN,  ":encoding(UTF-8)";
binmode STDOUT, ":encoding(UTF-8)";
binmode STDERR, ":encoding(UTF-8)";

Do this once, at the top of the program, before any I/O on those handles. After that the standard streams behave the same as a file opened with an encoding layer.

The use open pragma#

Writing the same :encoding(UTF-8) layer on every open is noisy. The open pragma sets the default layers for all subsequent opens in the enclosing scope:

use open ":std", ":encoding(UTF-8)";

The :std tag also re-layers STDIN, STDOUT, and STDERR on the spot, so you get the effect of three binmode calls and a default for every later open in one line.

This is the one-liner at the top of a modern Unicode-aware script:

use utf8;
use open ":std", ":encoding(UTF-8)";
  • use utf8 — literals in this source file are UTF-8.

  • use open ":std", ":encoding(UTF-8)" — stdin/stdout/stderr and every later open default to UTF-8 on the wire.

Binary I/O alongside text I/O#

Some handles carry bytes, not text — image files, compressed streams, network protocols that frame their own encoding. Open those :raw and never attach an encoding:

open my $img, "<:raw", "photo.jpg" or die "open: $!";

Within one program, some handles can be text (:encoding(UTF-8)) and others binary (:raw). Keep the labels straight: a binary handle gives you byte strings, a text handle gives you text strings, and $! is the same either way.

Interactive input#

A terminal does not tell you what encoding it is typing in. On modern Linux it is almost always UTF-8; on Windows the default console code page is still commonly CP1252 or CP437. The usual assumption for cross-platform scripts is UTF-8, with binmode applied at startup:

binmode STDIN, ":encoding(UTF-8)";

If your program must support legacy terminals, decode based on the LANG or LC_CTYPE environment variable, or accept a --encoding flag. There is no portable auto-detection.

Checking layers#

To see which layers are on a handle, use PerlIO::get_layers:

use PerlIO;
my @layers = PerlIO::get_layers(\*STDOUT);
print "@layers\n";                       # e.g. unix perlio encoding(utf-8-strict)

Useful when a program is silently producing double-encoded output and you want to find out which handle has the surprise layer on it.