Regex and properties#
Once your strings are decoded, regex works on characters — and
Perl’s regex engine knows about Unicode. \w matches every Unicode
word character, case-insensitive matching folds across scripts, and
\p{…} gives you access to the full Unicode character database.
This chapter covers the habits you need to make that work reliably.
Characters, not bytes#
On a text string, every regex metacharacter speaks in characters:
use utf8;
my $s = "café"; # 4 characters
$s =~ /^.{4}$/; # matches
$s =~ /^.{5}$/; # does not
On a binary string, the same metacharacters count bytes. That is usually not what you want — if you find yourself writing a regex against bytes that hold text, decode first.
\w, \d, \s and the friends#
In default (Unicode) mode, these shorthands match their Unicode equivalents:
\w— letters, digits, and underscore from every script.\d— decimal digits from every script (not just0–9).\s— whitespace of every kind, including U+00A0 no-break space.
use utf8;
"café" =~ /^\w+$/; # matches
"café" =~ /^[a-z]+$/; # does not — é is outside
"\x{0660}\x{0661}\x{0662}" =~ /^\d+$/; # matches Arabic-Indic digits
When you want the old ASCII-only meaning, add the /a modifier —
see Modifiers below.
The \p{…} property classes#
\p{PROPERTY} matches any character with a named Unicode property.
The property names are drawn from the Unicode character database;
the commonly useful ones are:
\p{Letter}, shorthand\p{L}— any letter.\p{Lu},\p{Ll},\p{Lt}— upper, lower, title case letters.\p{Number},\p{N},\p{Nd}— all numbers, decimal digits.\p{Punct},\p{P}— punctuation.\p{Space},\p{Zs}— whitespace, space separators.\p{ASCII}— the 128 ASCII code points.\p{Script=Greek}— every character that belongs to the Greek script. Any script name from the Unicode data works here.\p{Block=Cyrillic}— every character in the Cyrillic block. (Script and block are different: script is the writing system a character belongs to; block is the code-point range it lives in.)
\P{…} is the negation — any character without that property.
use utf8;
my $s = "Γειά σου, κόσμε";
my @greek_letters = $s =~ /(\p{Script=Greek})/g;
scalar @greek_letters; # 12
The full catalogue of properties ships with Perl; perldoc perluniprops enumerates every name the engine accepts.
Modifiers: /a, /u, /l, /aa#
Four modifiers change how character classes interpret themselves:
/u— Unicode mode.\wmatches Unicode word characters,\dmatches Unicode decimal digits, case folding uses Unicode rules. This is the default on text strings./a— ASCII mode.\wbecomes[A-Za-z0-9_],\dbecomes[0-9],\sbecomes[ \t\n\r\f]. Case folding is still Unicode —/foo/aistill matchesFÖOthroughÖif you wrotefoo— which is almost never what you want when you reached for/a./aa— strict ASCII./aplus case folding is restricted to ASCII./foo/aaimatchesFOObut notFÖO. Use this when you want an identifier match against a known-ASCII specification (HTTP header names, command keywords)./l— locale mode. Defers to the current POSIX locale. Almost never what you want in a Unicode-aware program; mentioned only so you can recognise it when another codebase uses it.
use utf8;
"café" =~ /\w+/; # matches whole word (Unicode)
"café" =~ /\w+/a; # matches "caf" (ASCII only)
"FOO" =~ /foo/ai; # matches (case fold anywhere)
"FÖO" =~ /foo/aa; # does not — strict ASCII
Rule of thumb: default mode is right for text; switch to /aa
when you are parsing an ASCII-specified protocol and want to reject
smuggled non-ASCII characters.
Case-insensitive matching#
/i folds both sides of the match using Unicode case-folding
tables. This handles the obvious European pairs (é/É,
ß/SS), and the less obvious ones (İ/i̇, fi/fi).
use utf8;
"Ångström" =~ /ångström/i; # matches
"STRASSE" =~ /straße/i; # matches — ß folds to SS
For literal-byte case folding (the old ASCII meaning of /i), use
/iaa.
Named code points#
\N{…} names a Unicode character by its official name:
use charnames ();
"\N{LATIN SMALL LETTER E WITH ACUTE}" eq "é"; # true
"\N{U+2014}" eq "—"; # em-dash
\N{U+HEX} works everywhere; the long names require
use charnames qw(:full) or use charnames qw(:loose) in some
Perl versions. In a pattern, \N{…} is useful when you want the
source to document which character you mean without relying on the
reader to recognise a hex escape.
Splitting and joining#
split and
substr both count in the units
of the string they operate on. On a text string, the count is
characters:
use utf8;
my $s = "a,é,b,Ω";
my @parts = split /,/, $s; # ("a", "é", "b", "Ω")
substr($s, 2, 1); # "é"
On a binary string, the same operations count bytes. This is the most common subtle bug — an old codebase that decoded input in one function but indexed the result in another, built before the decode step was introduced.