Unicode — a tutorial#
Modern programs carry accented letters, currency symbols, CJK ideographs, and emoji through their input and output. Perl handles Unicode well, but only if you are explicit about where text is text and where bytes are bytes. This tutorial is the shortest path to that discipline.
These pages do not teach Unicode itself. You are assumed to know that a character set numbers characters, that an encoding maps those numbers to bytes, and that UTF-8 is one such encoding among several. What follows is the Perl-specific habit you need on top.
The one idea#
A Perl string is either a text string — a sequence of Unicode
code points, where length counts
characters — or a binary string — a sequence of bytes, where
length counts bytes. The same
variable type holds both, but the meaning is different, and the
boundary between them is where encoding and decoding happen.
Every bug in this area comes from losing track of which kind of string you have. Every fix comes from making the boundary explicit.
my $text = "caf\x{e9}"; # 4 characters: c a f é
my $bytes = "caf\xc3\xa9"; # 5 bytes: c a f 0xc3 0xa9
length $text; # 4
length $bytes; # 5
Both strings “contain” the word café. Only one of them can go through a regex that expects characters; only the other can go down a byte-oriented socket.
The pipeline#
A program that handles text correctly follows one shape:
Decode everything coming in. Bytes on the outside become characters on the inside.
Process in characters. Regex, length, case conversion, the lot.
Encode everything going out. Characters on the inside become bytes on the outside.
Step 2 is where Perl is already comfortable. Steps 1 and 3 are where the discipline lives. This tutorial walks through both.
Chapters#
Strings and encodings —
use utf8, theEncodemodule, what the internal representation is and is not.I/O — opening files with an encoding, standard streams,
binmodeand the:encodinglayer.Regex and properties — matching Unicode in patterns,
\p{…}property classes, the/a/u/l/aamodifiers.Pitfalls — “Wide character in print”, double encoding, locale interactions, and how to recognise each one from its symptom.
Conventions#
Expected output appears as an inline
# …comment or directly below the code block.Code points are written with
\x{…}hex escapes so the printed text matches what you see.Examples assume a UTF-8-capable terminal. If yours is not, output containing non-ASCII characters will look different, but the program is still correct.