Regex and properties#

Once your strings are decoded, regex works on characters — and Perl’s regex engine knows about Unicode. \w matches every Unicode word character, case-insensitive matching folds across scripts, and \p{…} gives you access to the full Unicode character database. This chapter covers the habits you need to make that work reliably.

Characters, not bytes#

On a text string, every regex metacharacter speaks in characters:

use utf8;
my $s = "café";                          # 4 characters
$s =~ /^.{4}$/;                          # matches
$s =~ /^.{5}$/;                          # does not

On a binary string, the same metacharacters count bytes. That is usually not what you want — if you find yourself writing a regex against bytes that hold text, decode first.

\w, \d, \s and the friends#

In default (Unicode) mode, these shorthands match their Unicode equivalents:

  • \w — letters, digits, and underscore from every script.

  • \d — decimal digits from every script (not just 09).

  • \s — whitespace of every kind, including U+00A0 no-break space.

use utf8;
"café"  =~ /^\w+$/;                      # matches
"café"  =~ /^[a-z]+$/;                   # does not — é is outside
"\x{0660}\x{0661}\x{0662}" =~ /^\d+$/;   # matches Arabic-Indic digits

When you want the old ASCII-only meaning, add the /a modifier — see Modifiers below.

The \p{…} property classes#

\p{PROPERTY} matches any character with a named Unicode property. The property names are drawn from the Unicode character database; the commonly useful ones are:

  • \p{Letter}, shorthand \p{L} — any letter.

  • \p{Lu}, \p{Ll}, \p{Lt} — upper, lower, title case letters.

  • \p{Number}, \p{N}, \p{Nd} — all numbers, decimal digits.

  • \p{Punct}, \p{P} — punctuation.

  • \p{Space}, \p{Zs} — whitespace, space separators.

  • \p{ASCII} — the 128 ASCII code points.

  • \p{Script=Greek} — every character that belongs to the Greek script. Any script name from the Unicode data works here.

  • \p{Block=Cyrillic} — every character in the Cyrillic block. (Script and block are different: script is the writing system a character belongs to; block is the code-point range it lives in.)

\P{…} is the negation — any character without that property.

use utf8;
my $s = "Γειά σου, κόσμε";
my @greek_letters = $s =~ /(\p{Script=Greek})/g;
scalar @greek_letters;                   # 12

The full catalogue of properties ships with Perl; perldoc perluniprops enumerates every name the engine accepts.

Modifiers: /a, /u, /l, /aa#

Four modifiers change how character classes interpret themselves:

  • /uUnicode mode. \w matches Unicode word characters, \d matches Unicode decimal digits, case folding uses Unicode rules. This is the default on text strings.

  • /aASCII mode. \w becomes [A-Za-z0-9_], \d becomes [0-9], \s becomes [ \t\n\r\f]. Case folding is still Unicode — /foo/ai still matches FÖO through Ö if you wrote foo — which is almost never what you want when you reached for /a.

  • /aastrict ASCII. /a plus case folding is restricted to ASCII. /foo/aai matches FOO but not FÖO. Use this when you want an identifier match against a known-ASCII specification (HTTP header names, command keywords).

  • /llocale mode. Defers to the current POSIX locale. Almost never what you want in a Unicode-aware program; mentioned only so you can recognise it when another codebase uses it.

use utf8;
"café" =~ /\w+/;                         # matches whole word (Unicode)
"café" =~ /\w+/a;                        # matches "caf" (ASCII only)
"FOO"  =~ /foo/ai;                       # matches (case fold anywhere)
"FÖO"  =~ /foo/aa;                       # does not — strict ASCII

Rule of thumb: default mode is right for text; switch to /aa when you are parsing an ASCII-specified protocol and want to reject smuggled non-ASCII characters.

Case-insensitive matching#

/i folds both sides of the match using Unicode case-folding tables. This handles the obvious European pairs (é/É, ß/SS), and the less obvious ones (İ/, /fi).

use utf8;
"Ångström" =~ /ångström/i;               # matches
"STRASSE"  =~ /straße/i;                 # matches — ß folds to SS

For literal-byte case folding (the old ASCII meaning of /i), use /iaa.

Named code points#

\N{…} names a Unicode character by its official name:

use charnames ();
"\N{LATIN SMALL LETTER E WITH ACUTE}" eq "é";   # true
"\N{U+2014}" eq "—";                            # em-dash

\N{U+HEX} works everywhere; the long names require use charnames qw(:full) or use charnames qw(:loose) in some Perl versions. In a pattern, \N{…} is useful when you want the source to document which character you mean without relying on the reader to recognise a hex escape.

Splitting and joining#

split and substr both count in the units of the string they operate on. On a text string, the count is characters:

use utf8;
my $s = "a,é,b,Ω";
my @parts = split /,/, $s;               # ("a", "é", "b", "Ω")
substr($s, 2, 1);                        # "é"

On a binary string, the same operations count bytes. This is the most common subtle bug — an old codebase that decoded input in one function but indexed the result in another, built before the decode step was introduced.