--- name: regex and unicode --- # Unicode Perl's strings are Unicode strings. Every code point from `\x{0000}` through `\x{10FFFF}` is a valid character, regardless of the underlying byte encoding. Regexps operate on characters, not bytes — so `\w`, `\d`, `\s`, `.`, and `\X` all understand the Unicode world. This chapter covers how to write Unicode characters in a pattern, how to match by Unicode property, and how the *charset modifiers* `/a`, `/u`, `/l`, `/d` tune the default character-class behaviour. ## Writing Unicode in a pattern Four ways to denote a specific code point: | Form | Example | Notes | |-----------------|-------------------------------|------------------------| | `\xHH` | `\x41` (A) | Two hex digits, 0–255 | | `\x{…}` | `\x{263a}` (☺) | Any code point | | `\o{…}` | `\o{101}` (A) | Octal | | `\N{name}` | `\N{GREEK SMALL LETTER SIGMA}`| Unicode character name | ```perl /\x{263a}/; # WHITE SMILING FACE /\N{GREEK SMALL LETTER SIGMA}/; # σ /\N{U+03C3}/; # same codepoint via U+ form ``` Whitespace and underscores inside `\N{…}` and `\p{…}` are ignored, so you can write `\N{GREEK SMALL LETTER SIGMA}` or `\N{greek_small_letter_sigma}`. Inside a character class, the same escapes work: ```perl /[\x{0370}-\x{03ff}]/; # any Greek block character /[\N{SECTION SIGN}\N{PILCROW SIGN}]/; ``` ## Matching by property: \p{…} and \P{…} Unicode classifies every code point by dozens of properties. The most useful in everyday patterns: | Property | Matches | |------------------|----------------------------------------| | `\p{L}` | any letter | | `\p{Ll}` | lowercase letter | | `\p{Lu}` | uppercase letter | | `\p{Lt}` | titlecase letter | | `\p{N}` | any numeric character | | `\p{Nd}` | decimal digit | | `\p{P}` | punctuation | | `\p{M}` | mark (combining character) | | `\p{S}` | symbol | | `\p{Z}` | separator (including spaces) | | `\p{C}` | control, format, unassigned, private-use | | `\p{ASCII}` | code point 0–127 | | `\p{White_Space}`| any Unicode whitespace | `\P{X}` is the negation of `\p{X}` — matches any code point *not* in property X. Short single-letter properties drop the braces: ```perl /\pL/; # letter — same as \p{L} /\pN/; # number /\pM/; # mark ``` ## Matching by script Scripts (writing systems) are properties too: ```perl /\p{Latin}/; /\p{Greek}/; /\p{Cyrillic}/; /\p{Arabic}/; /\p{Han}/; # Chinese, Japanese kanji, Korean hanja /\p{Hiragana}/; /\p{Katakana}/; /\p{Hangul}/; /\p{Script=Greek}/; # compound form, same meaning ``` For linguistic classification, prefer `\p{Script_Extensions=…}` — it handles code points used by multiple scripts more sensibly than `\p{Script=…}`. Perl's `\p{Greek}` shorthand actually uses `Script_Extensions` under the hood. ## Compound properties Some properties have multiple values. The compound form `\p{Property=Value}` (or `\p{Property:Value}`) spells them out: ```perl /\p{General_Category=Uppercase_Letter}/; # same as \p{Lu} /\p{Block=Greek_and_Coptic}/; /\p{Numeric_Type=Decimal}/; /\p{Bidi_Class=R}/; # right-to-left characters ``` Case, spaces, and underscores are ignored in the property name and value, so `\p{gc=lu}` is the same as `\p{General_Category=Uppercase_Letter}`. ## Grapheme clusters: \X A *grapheme cluster* is what users perceive as a single character. A Danish `å` can be one code point (`U+00E5`) or two (`A` followed by combining ring `U+030A`). Both look identical on screen. `\X` matches either — one atomic user-visible character. ```perl my $s = "A\x{030A}"; # A + combining ring length($s); # 2 — code-point count $s =~ /./; # matches just the 'A' $s =~ /\X/; # matches the whole cluster ``` Use `\X` when iterating visibly across user-facing text: cursor movement, column counting, truncation. ## The charset modifiers The flags `/a`, `/u`, `/l`, `/d` control how the shorthand classes (`\d`, `\w`, `\s`) and case folding behave. They are orthogonal to `/i`, `/m`, `/s`, `/x`. ### /u — Unicode semantics `\d`, `\w`, `\s`, `\b`, case folding, and `[[:alpha:]]` use full Unicode definitions. This is the default when `use v5.12` or later is in effect. ```perl "item\x{0660}" =~ /\w+\d/u; # matches — U+0660 is an Arabic-Indic zero "naïve" =~ /\w+/u; # matches the whole word ``` ### /a — ASCII-only classes Restricts `\d`, `\w`, `\s`, `[[:…:]]` to the ASCII range. Useful when you *want* English-like text and no more: ```perl "item\x{0660}" =~ /\w+\d/a; # does not match — \d is [0-9] under /a "item0" =~ /\w+\d/a; # matches ``` `/aa` (doubled) additionally prevents case-insensitive matches crossing the ASCII/non-ASCII boundary. Under plain `/ia`, `K` could match `\x{212A}` (KELVIN SIGN) because the Kelvin sign casefolds to `k`. Under `/iaa` it does not: ```perl "K" =~ /k/i; # matches "\x{212A}" =~ /k/i; # matches — Kelvin sign casefolds to 'k' "\x{212A}" =~ /k/iaa; # does NOT match — ASCII-only i ``` ### /l — locale semantics `\d`, `\w`, `\s`, case folding, and `[[:alpha:]]` follow the current POSIX locale. Only sensible after `use locale`. Rarely what you want — locale behaviour is implementation-specific and harder to reason about than `/a` or `/u`. ### /d — default pre-5.12 semantics Pre-Perl-5.14 rules: `\d`, `\w`, `\s` depend on whether the target string is flagged as UTF-8 internally. This is the source of most "works on my machine" Unicode regexp bugs. Avoid `/d` in new code. The default under `use v5.12` is `/u`. ## Pragma interactions Several pragmas affect default regexp behaviour: ```perl use v5.12; # implicit feature 'unicode_strings' use feature 'unicode_strings'; # explicit use re '/u'; # set default modifier for this scope ``` `use feature 'unicode_strings'` (included in `use v5.12`) makes all strings Unicode for character-class purposes, regardless of whether they are marked UTF-8 internally. Turn it on at the top of every file. ## Case folding Under `/iu` (the default in modern Perl), case-insensitive matching uses the full Unicode case-folding tables: ```perl "Grüße" =~ /grüsse/iu; # matches — ß folds to 'ss' "Σίγμα" =~ /σίγμα/iu; # matches — upper Σ ↔ lower σ "İstanbul" =~ /istanbul/iu; # matches — dotted I folds ``` Under `/ia`, fold only ASCII letters. Under `/iaa`, plus the Kelvin/K rule above. ## The \b{…} word-boundary variants Beyond plain `\b`, Perl offers semantic variants that handle real-world text better: - `\b{wb}` — word boundary by Unicode "word break" rules. Handles apostrophes, hyphens, numerics-with-commas, and script-specific rules. - `\b{sb}` — sentence boundary. - `\b{lb}` — line break (suitable for wrapping). - `\b{gcb}` — grapheme cluster boundary (same effect as `\X`). ```perl "don't" =~ /.+?\b{wb}/x; # matches whole word — apostrophe is inside "don't" =~ /.+?\b/x; # stops at the apostrophe — plain \b splits ``` For natural-language processing, `\b{wb}` and `\b{sb}` are almost always what you want. ## Encoding vs. character A Perl string is always a sequence of code points. The bytes on disk are the I/O layer's concern — use `use utf8` for source-code literals, `:encoding(UTF-8)` on filehandles, and `decode`/`encode` from Encode when crossing the boundary. Patterns never operate on the encoded bytes; they always operate on code points. ```perl use utf8; # source code is UTF-8 use feature 'unicode_strings'; use open ':std', ':encoding(UTF-8)'; # STDIN/STDOUT/STDERR while (my $line = <>) { $line =~ /\p{Lu}/ and print "has uppercase letter\n"; } ``` Get the I/O right once, and patterns just work on characters for the rest of the program. ## See also - [`perlre`](../../p5/core/perlre) — full property reference. - [`quotemeta`](../../p5/core/perlfunc/quotemeta) — Unicode-aware metacharacter escaping. - The [character classes](character-classes) chapter — non-Unicode class basics. - The [modifiers](modifiers) chapter — `/i`, `/x`, other modifiers that combine with charset modifiers.