--- name: Regex and Unicode properties --- # Regex and properties Once your strings are decoded, regex works on characters — and Perl's regex engine knows about Unicode. `\w` matches every Unicode word character, case-insensitive matching folds across scripts, and `\p{…}` gives you access to the full Unicode character database. This chapter covers the habits you need to make that work reliably. ## Characters, not bytes On a text string, every regex metacharacter speaks in characters: ```perl use utf8; my $s = "café"; # 4 characters $s =~ /^.{4}$/; # matches $s =~ /^.{5}$/; # does not ``` On a binary string, the same metacharacters count bytes. That is usually not what you want — if you find yourself writing a regex against bytes that hold text, decode first. ## `\w`, `\d`, `\s` and the friends In default (Unicode) mode, these shorthands match their Unicode equivalents: - `\w` — letters, digits, and underscore from every script. - `\d` — decimal digits from every script (not just `0`–`9`). - `\s` — whitespace of every kind, including U+00A0 no-break space. ```perl use utf8; "café" =~ /^\w+$/; # matches "café" =~ /^[a-z]+$/; # does not — é is outside "\x{0660}\x{0661}\x{0662}" =~ /^\d+$/; # matches Arabic-Indic digits ``` When you want the old ASCII-only meaning, add the `/a` modifier — see *Modifiers* below. ## The `\p{…}` property classes `\p{PROPERTY}` matches any character with a named Unicode property. The property names are drawn from the Unicode character database; the commonly useful ones are: - `\p{Letter}`, shorthand `\p{L}` — any letter. - `\p{Lu}`, `\p{Ll}`, `\p{Lt}` — upper, lower, title case letters. - `\p{Number}`, `\p{N}`, `\p{Nd}` — all numbers, decimal digits. - `\p{Punct}`, `\p{P}` — punctuation. - `\p{Space}`, `\p{Zs}` — whitespace, space separators. - `\p{ASCII}` — the 128 ASCII code points. - `\p{Script=Greek}` — every character that belongs to the Greek script. Any script name from the Unicode data works here. - `\p{Block=Cyrillic}` — every character in the Cyrillic block. (Script and block are different: *script* is the writing system a character belongs to; *block* is the code-point range it lives in.) `\P{…}` is the negation — any character without that property. ```perl use utf8; my $s = "Γειά σου, κόσμε"; my @greek_letters = $s =~ /(\p{Script=Greek})/g; scalar @greek_letters; # 12 ``` The full catalogue of properties ships with Perl; `perldoc perluniprops` enumerates every name the engine accepts. ## Modifiers: `/a`, `/u`, `/l`, `/aa` Four modifiers change how character classes interpret themselves: - `/u` — **Unicode mode.** `\w` matches Unicode word characters, `\d` matches Unicode decimal digits, case folding uses Unicode rules. This is the default on text strings. - `/a` — **ASCII mode.** `\w` becomes `[A-Za-z0-9_]`, `\d` becomes `[0-9]`, `\s` becomes `[ \t\n\r\f]`. Case folding is still Unicode — `/foo/ai` still matches `FÖO` through `Ö` if you wrote `foo` — which is almost never what you want when you reached for `/a`. - `/aa` — **strict ASCII.** `/a` plus case folding is restricted to ASCII. `/foo/aai` matches `FOO` but not `FÖO`. Use this when you want an identifier match against a known-ASCII specification (HTTP header names, command keywords). - `/l` — **locale mode.** Defers to the current POSIX locale. Almost never what you want in a Unicode-aware program; mentioned only so you can recognise it when another codebase uses it. ```perl use utf8; "café" =~ /\w+/; # matches whole word (Unicode) "café" =~ /\w+/a; # matches "caf" (ASCII only) "FOO" =~ /foo/ai; # matches (case fold anywhere) "FÖO" =~ /foo/aa; # does not — strict ASCII ``` Rule of thumb: default mode is right for text; switch to `/aa` when you are parsing an ASCII-specified protocol and want to reject smuggled non-ASCII characters. ## Case-insensitive matching `/i` folds both sides of the match using Unicode case-folding tables. This handles the obvious European pairs (`é`/`É`, `ß`/`SS`), and the less obvious ones (`İ`/`i̇`, `fi`/`fi`). ```perl use utf8; "Ångström" =~ /ångström/i; # matches "STRASSE" =~ /straße/i; # matches — ß folds to SS ``` For literal-byte case folding (the old ASCII meaning of `/i`), use `/iaa`. ## Named code points `\N{…}` names a Unicode character by its official name: ```perl use charnames (); "\N{LATIN SMALL LETTER E WITH ACUTE}" eq "é"; # true "\N{U+2014}" eq "—"; # em-dash ``` `\N{U+HEX}` works everywhere; the long names require `use charnames qw(:full)` or `use charnames qw(:loose)` in some Perl versions. In a pattern, `\N{…}` is useful when you want the source to document which character you mean without relying on the reader to recognise a hex escape. ## Splitting and joining [`split`](../../p5/core/perlfunc/split) and [`substr`](../../p5/core/perlfunc/substr) both count in the units of the string they operate on. On a text string, the count is characters: ```perl use utf8; my $s = "a,é,b,Ω"; my @parts = split /,/, $s; # ("a", "é", "b", "Ω") substr($s, 2, 1); # "é" ``` On a binary string, the same operations count bytes. This is the most common subtle bug — an old codebase that decoded input in one function but indexed the result in another, built before the decode step was introduced.