Regex and properties#

Once your strings are decoded, regex works on characters - and Perl’s regex engine knows about Unicode. \w matches every Unicode word character, case-insensitive matching folds across scripts, and \p{…} gives you access to the full Unicode character database. This chapter covers the habits you need to make that work reliably.

Characters, not bytes#

On a text string, every regex metacharacter speaks in characters:

use utf8;
my $s = "café";                          # 4 characters
$s =~ /^.{4}$/;                          # matches
$s =~ /^.{5}$/;                          # does not

On a binary string, the same metacharacters count bytes. That is usually not what you want - if you find yourself writing a regex against bytes that hold text, decode first.

`\w`, `\d`, `\s` and the friends#

In default (Unicode) mode, these shorthands match their Unicode equivalents:

\w - letters, digits, and underscore from every script.
\d - decimal digits from every script (not just 0–9).
\s - whitespace of every kind, including U+00A0 no-break space.

use utf8;
"café"  =~ /^\w+$/;                      # matches
"café"  =~ /^[a-z]+$/;                   # does not - é is outside
"\x{0660}\x{0661}\x{0662}" =~ /^\d+$/;   # matches Arabic-Indic digits

When you want the old ASCII-only meaning, add the /a modifier - see Modifiers below.

The `\p{…}` property classes#

\p{PROPERTY} matches any character with a named Unicode property. The property names are drawn from the Unicode character database; the commonly useful ones are:

\p{Letter}, shorthand \p{L} - any letter.
\p{Lu}, \p{Ll}, \p{Lt} - upper, lower, title case letters.
\p{Number}, \p{N}, \p{Nd} - all numbers, decimal digits.
\p{Punct}, \p{P} - punctuation.
\p{Space}, \p{Zs} - whitespace, space separators.
\p{ASCII} - the 128 ASCII code points.
\p{Script=Greek} - every character that belongs to the Greek script. Any script name from the Unicode data works here.
\p{Block=Cyrillic} - every character in the Cyrillic block. (Script and block are different: script is the writing system a character belongs to; block is the code-point range it lives in.)

\P{…} is the negation - any character without that property.

use utf8;
my $s = "Γειά σου, κόσμε";
my @greek_letters = $s =~ /(\p{Script=Greek})/g;
scalar @greek_letters;                   # 12

The full catalogue of properties ships with Perl; perldoc perluniprops enumerates every name the engine accepts.

Modifiers: `/a`, `/u`, `/l`, `/aa`#

Four modifiers change how character classes interpret themselves:

/u - Unicode mode. \w matches Unicode word characters, \d matches Unicode decimal digits, case folding uses Unicode rules. This is the default on text strings.
/a - ASCII mode. \w becomes [A-Za-z0-9_], \d becomes [0-9], \s becomes [ \t\n\r\f]. Case folding is still Unicode - /foo/ai still matches FÖO through Ö if you wrote foo - which is almost never what you want when you reached for /a.
/aa - strict ASCII. /a plus case folding is restricted to ASCII. /foo/aai matches FOO but not FÖO. Use this when you want an identifier match against a known-ASCII specification (HTTP header names, command keywords).
/l - locale mode. Defers to the current POSIX locale. Almost never what you want in a Unicode-aware program; mentioned only so you can recognise it when another codebase uses it.

use utf8;
"café" =~ /\w+/;                         # matches whole word (Unicode)
"café" =~ /\w+/a;                        # matches "caf" (ASCII only)
"FOO"  =~ /foo/ai;                       # matches (case fold anywhere)
"FÖO"  =~ /foo/aa;                       # does not - strict ASCII

Rule of thumb: default mode is right for text; switch to /aa when you are parsing an ASCII-specified protocol and want to reject smuggled non-ASCII characters.

Case-insensitive matching#

/i folds both sides of the match using Unicode case-folding tables. This handles the obvious European pairs (é/É, ß/SS), and the less obvious ones (İ/i̇, ﬁ/fi).

use utf8;
"Ångström" =~ /ångström/i;               # matches
"STRASSE"  =~ /straße/i;                 # matches - ß folds to SS

For literal-byte case folding (the old ASCII meaning of /i), use /iaa.

Named code points#

\N{...} names a Unicode character, either by its official name or by its code-point number:

use charnames ();
"\N{LATIN SMALL LETTER E WITH ACUTE}" eq "\x{e9}";   # true
"\N{U+00E9}" eq "\x{e9}";                             # the same character

\N{U+HEX} works without any pragma; the long names go through the charnames module, loaded automatically inside a \N{...} escape (or explicitly with use charnames qw(:full)). In a pattern, \N{...} is useful when you want the source to document which character you mean without relying on the reader to recognise a hex escape:

"café" =~ /caf\N{LATIN SMALL LETTER E WITH ACUTE}/;  # matches

Splitting and joining#

split and substr both count in the units of the string they operate on. On a text string, the count is characters:

use utf8;
my $s = "a,é,b,Ω";
my @parts = split /,/, $s;               # ("a", "é", "b", "Ω")
substr($s, 2, 1);                        # "é"

On a binary string, the same operations count bytes. This is the most common subtle bug - an old codebase that decoded input in one function but indexed the result in another, built before the decode step was introduced.