Unicode - a tutorial#

Modern programs carry accented letters, currency symbols, CJK ideographs, and emoji through their input and output. Perl handles Unicode well, but only if you are explicit about where text is text and where bytes are bytes. This tutorial is the shortest path to that discipline.

These pages do not teach Unicode itself. You are assumed to know that a character set numbers characters, that an encoding maps those numbers to bytes, and that UTF-8 is one such encoding among several. What follows is the Perl-specific habit you need on top.

The one idea#

A Perl string is either a text string - a sequence of Unicode code points, where length counts characters - or a binary string - a sequence of bytes, where length counts bytes. The same variable type holds both, but the meaning is different, and the boundary between them is where encoding and decoding happen.

Every bug in this area comes from losing track of which kind of string you have. Every fix comes from making the boundary explicit.

my $text  = "caf\x{e9}";          # 4 characters: c a f é
my $bytes = "caf\xc3\xa9";        # 5 bytes:      c a f 0xc3 0xa9
length $text;                     # 4
length $bytes;                    # 5

Both strings “contain” the word café. Only one of them can go through a regex that expects characters; only the other can go down a byte-oriented socket.

The pipeline#

A program that handles text correctly follows one shape:

Decode everything coming in. Bytes on the outside become characters on the inside.
Process in characters. Regex, length, case conversion, the lot.
Encode everything going out. Characters on the inside become bytes on the outside.

Step 2 is where Perl is already comfortable. Steps 1 and 3 are where the discipline lives. This tutorial walks through both.

Chapters#

Strings and encodings - use utf8, the Encode module, what the internal representation is and is not.
I/O - opening files with an encoding, standard streams, binmode and the :encoding layer.
Regex and properties - matching Unicode in patterns, \p{…} property classes, the /a /u /l /aa modifiers.
Pitfalls - “Wide character in print”, double encoding, locale interactions, and how to recognise each one from its symptom.
Recipes - quick-reference one-liners for case-insensitive sorting with fc, normalising line breaks with \R, and working by grapheme cluster with \X.

Conventions#

Expected output appears as an inline # … comment or directly below the code block.
Code points are written with \x{…} hex escapes so the printed text matches what you see.
Examples assume a UTF-8-capable terminal. If yours is not, output containing non-ASCII characters will look different, but the program is still correct.