# Unicode Perl’s strings are Unicode strings. Every code point from `\x{0000}` through `\x{10FFFF}` is a valid character, regardless of the underlying byte encoding. Regexps operate on characters, not bytes — so `\w`, `\d`, `\s`, `.`, and `\X` all understand the Unicode world. This chapter covers how to write Unicode characters in a pattern, how to match by Unicode property, and how the *charset modifiers* `/a`, `/u`, `/l`, `/d` tune the default character-class behaviour. It closes with *Script Runs*, the mechanism for rejecting strings that mix scripts when they shouldn’t. ## Writing Unicode in a pattern Four ways to denote a specific code point: | Form | Example | Notes | |------------|--------------------------------|------------------------| | `\xHH` | `\x41` (A) | Two hex digits, 0–255 | | `\x{…}` | `\x{263a}` (☺) | Any code point | | `\o{…}` | `\o{101}` (A) | Octal | | `\N{name}` | `\N{GREEK SMALL LETTER SIGMA}` | Unicode character name | ```perl /\x{263a}/; # WHITE SMILING FACE /\N{GREEK SMALL LETTER SIGMA}/; # σ /\N{U+03C3}/; # same codepoint via U+ form ``` Whitespace and underscores inside `\N{…}` and `\p{…}` are ignored, so you can write `\N{GREEK SMALL LETTER SIGMA}` or `\N{greek_small_letter_sigma}`. Inside a character class, the same escapes work: ```perl /[\x{0370}-\x{03ff}]/; # any Greek block character /[\N{SECTION SIGN}\N{PILCROW SIGN}]/; ``` `\N{NAME}` (named character) is *not* the same as `\N` (any character except `\n` — see the [character classes](character-classes.md) chapter). The presence or absence of the brace is what selects. ## Matching by property: `\p{…}` and `\P{…}` Unicode classifies every code point by dozens of properties. The most useful in everyday patterns: | Property | Matches | |-------------------|------------------------------------------| | `\p{L}` | any letter | | `\p{Ll}` | lowercase letter | | `\p{Lu}` | uppercase letter | | `\p{Lt}` | titlecase letter | | `\p{N}` | any numeric character | | `\p{Nd}` | decimal digit | | `\p{P}` | punctuation | | `\p{M}` | mark (combining character) | | `\p{S}` | symbol | | `\p{Z}` | separator (including spaces) | | `\p{C}` | control, format, unassigned, private-use | | `\p{ASCII}` | code point 0–127 | | `\p{White_Space}` | any Unicode whitespace | `\P{X}` is the negation of `\p{X}` — matches any code point *not* in property X. Short single-letter properties drop the braces: ```perl /\pL/; # letter — same as \p{L} /\pN/; # number /\pM/; # mark ``` ### Compound properties Some properties have multiple values. The compound form `\p{Property=Value}` (or `\p{Property:Value}`) spells them out: ```perl /\p{General_Category=Uppercase_Letter}/; # same as \p{Lu} /\p{Block=Greek_and_Coptic}/; /\p{Numeric_Type=Decimal}/; /\p{Bidi_Class=R}/; # right-to-left characters ``` Case, spaces, and underscores are ignored in the property name and value, so `\p{gc=lu}` is the same as `\p{General_Category=Uppercase_Letter}`. For the full list of properties and values, see [`perluniprops`](https://perldoc.perl.org/perluniprops) — that list is generated from Unicode data and is not duplicated here. ## Matching by script Scripts (writing systems) are properties too: ```perl /\p{Latin}/; /\p{Greek}/; /\p{Cyrillic}/; /\p{Arabic}/; /\p{Han}/; # Chinese, Japanese kanji, Korean hanja /\p{Hiragana}/; /\p{Katakana}/; /\p{Hangul}/; /\p{Script=Greek}/; # compound form, same meaning ``` For linguistic classification, prefer `\p{Script_Extensions=…}` — it handles code points used by multiple scripts more sensibly than `\p{Script=…}`. Perl’s `\p{Greek}` shorthand actually uses `Script_Extensions` under the hood. ## Grapheme clusters: `\X` A *grapheme cluster* is what users perceive as a single character. A Danish `å` can be one code point (`U+00E5`) or two (`A` followed by combining ring `U+030A`). Both look identical on screen. `\X` matches either — one atomic user-visible character. ```perl my $s = "A\x{030A}"; # A + combining ring length($s); # 2 — code-point count $s =~ /./; # matches just the 'A' $s =~ /\X/; # matches the whole cluster ``` Use `\X` when iterating visibly across user-facing text: cursor movement, column counting, truncation, anywhere «characters» means «what the user sees». ## Extended bracketed character classes — `(?[…])` The standard `[…]` form takes unions of characters; `(?[…])` adds set-algebra operators that combine Unicode properties: ```perl # Letters that are Greek: /(?[ \p{Letter} & \p{Greek} ])/; # Letters that are not Latin: /(?[ \p{Letter} - \p{Latin} ])/; # Hex digits without 'a' through 'f': /(?[ [0-9A-Fa-f] - [a-f] ])/; ``` Operators: `+` (or `|`) for union, `&` for intersection, `-` for difference, `^` for symmetric difference, `!` for complement. The construct is itself a character class — it matches one character — and quantifies normally. The [character classes](character-classes.md) chapter covers it in full. ## The charset modifiers The flags `/a`, `/u`, `/l`, `/d` control how the shorthand classes (`\d`, `\w`, `\s`) and case folding behave. They are orthogonal to `/i`, `/m`, `/s`, `/x`. At most one can be in effect at a time; `/aa` is a refinement of `/a`. ### `/u` — Unicode semantics `\d`, `\w`, `\s`, `\b`, case folding, and `[[:alpha:]]` use full Unicode definitions. This is the default when `use v5.12` or later is in effect. ```perl "item\x{0660}" =~ /\w+\d/u; # matches — U+0660 is an Arabic-Indic zero "naïve" =~ /\w+/u; # matches the whole word ``` Modern code should use `/u`. `use v5.12+` selects it automatically. ### `/a` — ASCII-only classes Restricts `\d`, `\w`, `\s`, `[[:…:]]` to the ASCII range. Useful when you *want* English-like text and no more: ```perl "item\x{0660}" =~ /\w+\d/a; # does not match — \d is [0-9] under /a "item0" =~ /\w+\d/a; # matches ``` ### `/aa` — strict ASCII `/aa` (doubled) additionally prevents case-insensitive matches crossing the ASCII/non-ASCII boundary. Under plain `/ia`, `K` could match `\x{212A}` (KELVIN SIGN) because the Kelvin sign casefolds to `k`. Under `/iaa` it does not: ```perl "K" =~ /k/i; # matches "\x{212A}" =~ /k/i; # matches — Kelvin sign casefolds to 'k' "\x{212A}" =~ /k/iaa; # does NOT match — ASCII-only i ``` `/aa` is the strict form when the input is genuinely ASCII-only and you want the engine to refuse Unicode matches even under `/i`. ### `/l` — locale semantics `\d`, `\w`, `\s`, case folding, and `[[:alpha:]]` follow the current POSIX locale. Only sensible after `use locale`. Rarely what you want — locale behaviour is implementation-specific and harder to reason about than `/a` or `/u`. Provided for legacy code that genuinely needs locale-dependent matching. ### `/d` — dual-mode semantics Under `/d`, `\d`, `\w`, `\s` depend on whether the target string is flagged as UTF-8 internally. This is the source of most «works on my machine» Unicode regexp bugs. **Avoid `/d` in new code.** The default under `use v5.12` is `/u`. If you inherit code where `/d` matters, the failure mode is: the same input string produces different match results depending on whether some upstream operation flagged the string UTF-8. The fix is `use feature 'unicode_strings'` (or `use v5.12+`), which makes the rules independent of the UTF-8 flag. ### Which modifier is in effect? Resolution order: 1. Explicit modifier on the regex (`/a`, `/u`, `/l`, `/d`). 2. `use re '/u'` (or other) in the lexical scope. 3. `use locale` selects `/l`. 4. `use feature 'unicode_strings'` (and `use v5.12+`) selects `/u`. 5. Otherwise, `/d` (the dual-mode fallback; avoid). Programs should use `use v5.12;` (or later) at the top of each file; this gives you `/u`, which is what you want. ## Pragma interactions ```perl use v5.12; # implicit feature 'unicode_strings' use feature 'unicode_strings'; # explicit use re '/u'; # set default modifier for this scope ``` `use feature 'unicode_strings'` (included in `use v5.12`) makes all strings Unicode for character-class purposes, regardless of whether they are marked UTF-8 internally. Turn it on at the top of every file. ## Case folding Under `/iu` (the default), case-insensitive matching uses the full Unicode case-folding tables: ```perl "Grüße" =~ /grüsse/iu; # matches — ß folds to 'ss' "Σίγμα" =~ /σίγμα/iu; # matches — upper Σ ↔ lower σ "İstanbul" =~ /istanbul/iu; # matches — dotted I folds ``` Under `/ia`, fold only ASCII letters. Under `/iaa`, plus the Kelvin/K rule above. The folding happens at compile time — pperl does not lowercase the input string at match time. ## The `\b{…}` word-boundary variants Beyond plain `\b`, Perl offers semantic variants that handle real-world text better: | Boundary | Meaning | |------------|-------------------------------------------------------------| | `\b{wb}` | Unicode word boundary — handles `don't`, `state-of-the-art` | | `\b{sb}` | sentence boundary | | `\b{lb}` | line break (suitable for line wrapping) | | `\b{gcb}` | grapheme cluster boundary (same effect as `\X`) | ```perl "don't" =~ /.+?\b{wb}/x; # matches whole word — apostrophe is inside "don't" =~ /.+?\b/x; # stops at the apostrophe — plain \b splits ``` For natural-language processing, `\b{wb}` and `\b{sb}` are almost always what you want. Plain `\b` is for ASCII identifier contexts. The [anchors and assertions](anchors-and-assertions.md) chapter has the broader boundary discussion; this chapter is the home for the Unicode-aware variants because their definitions are Unicode-driven. ## Script runs Different writing systems have visually-similar characters. The Latin `a`, the Cyrillic `а` (U+0430), and the Greek `α` (U+03B1) look indistinguishable. Used together — typically in a URL or identifier — they enable a *homograph attack*: > paypal.com Could be all Latin (legitimate). Could be a Cyrillic `а` embedded in otherwise-Latin text (a phishing attack pointing to a different domain). The browser’s URL renders identically; the bytes do not. A *script run* is a sequence of characters all from the same Unicode script (per `Script_Extensions` and Unicode UTS 39). Most legitimate words are script runs; mixed-script strings are rare outside specific multilingual contexts and almost always suspicious. Perl provides four constructs to require that a matched sub-pattern be a script run: | Form | Meaning | |-------------------------------------------|----------------------------------------------| | `(*script_run:PAT)` / `(*sr:PAT)` | match `PAT`, then check it is a script run | | `(*atomic_script_run:PAT)` / `(*asr:PAT)` | same, but atomic (faster, less backtracking) | ```perl # Match a domain label only if it is a script run: $label =~ /(*sr: \w+ )/x or warn "mixed-script label"; # Atomic version: faster on adversarial input. $label =~ /(*asr: \w+ )/x or warn "mixed-script label"; ``` The atomic form is what you want in most production code — it prevents the script-run check from being defeated by catastrophic backtracking on a malicious input. `(*sr:…)` is the non-atomic equivalent (`(*sr:(?>PAT))` is the same as `(*asr:PAT)`). ### Script run rules A sequence is a script run if and only if all of: 1. **No code point in the sequence has the `Script_Extension` property `Unknown`.** This excludes private-use and surrogate code points; you cannot smuggle them through. 2. **All characters come from the Common script, the Inherited script, and at most one other script.** 3. **All decimal digits come from the same set of ten.** Many scripts have their own digits (Arabic-Indic, Devanagari, etc.); a string mixing `1` (ASCII) with `١` (Arabic-Indic one) is *not* a script run, even if other characters are compatible. Three pseudo-scripts get special handling: - **Common** — punctuation, ASCII digits 0–9, mathematical symbols, emoji, full-width digits. These can appear in any script run *except* that all decimal digits in the sequence must still come from the same set of ten. - **Inherited** — combining marks (accents, diacritics). These attach to the previous character and inherit its script. - **Unknown** — unassigned code points and similar. A single-character string of unknown is allowed; a longer string containing one is not. The relaxation for Common and Inherited is what allows real-world text to be a script run despite punctuation and combining marks. The strict-equality rule for digits is what defeats homograph attacks that mix digits from look-alike scripts. ### When script runs help - **URL parsing**: refuse domain labels that are not script runs. - **Identifier validation**: programming-language identifiers should normally be script runs. - **Form input sanitisation**: real names are script runs; mixed-script names are usually attacks or test data. Script runs are a 2018-and-later feature. They are what Perl’s regex engine adds to a problem the rest of the engine cannot solve directly: a property of the *whole* match, not of any single character. ## Cross-engine: Unicode coverage | Feature | Perl 5.42 | PCRE2 | Emacs (limited) | POSIX | RE2 / Go | |----------------------------|-------------|---------|-------------------|---------|-------------------------------------| | `\p{General_Category}` | yes | yes | partial | no | yes | | `\p{Script}` | yes | yes | partial | no | yes | | `\p{Property=Value}` | yes | yes | partial | no | yes | | `\X` grapheme cluster | yes | yes | no | no | yes (`(?-X:...)` is something else) | | `\b{wb}`, `\b{sb}` | yes | no | no | no | no | | Script Runs `(*sr:…)` | yes | no | no | no | no | | Default to Unicode `\d/\w` | under `/u` | no | no | no | no (Unicode under `(?u)`) | Perl is a leader here: `\b{wb}` and Script Runs are essentially unique to Perl among the comparison set. RE2 / Go and PCRE2 both default to ASCII for `\d`, `\w`, `\s`; RE2 enables Unicode under the `(?u)` flag. PCRE2 has property support but lacks the boundary variants and script runs. Emacs has property support that varies by build. The full table is in the [cross-engine](cross-engine.md) chapter. ## Encoding vs. character A Perl string is always a sequence of code points. The bytes on disk are the I/O layer’s concern — use `use utf8` for source- code literals, `:encoding(UTF-8)` on filehandles, and `decode`/`encode` from `Encode` when crossing the boundary. Patterns never operate on the encoded bytes; they always operate on code points. ```perl use utf8; # source code is UTF-8 use feature 'unicode_strings'; use open ':std', ':encoding(UTF-8)'; # STDIN/STDOUT/STDERR while (my $line = <>) { $line =~ /\p{Lu}/ and print "has uppercase letter\n"; } ``` Get the I/O right once, and patterns just work on characters for the rest of the program. ## See also - The [character classes](character-classes.md) chapter — non- Unicode class basics, `(?[…])`. - The [anchors and assertions](anchors-and-assertions.md) chapter — `\b`, `\B`, plus the `\b{…}` family in their boundary role. - The [modifiers](modifiers.md) chapter — `/i`, `/x`, `/m`, `/s`, and how they combine with charset modifiers. - The [cross-engine](cross-engine.md) chapter — Unicode coverage across regex engines. - [`quotemeta`](../../p5/core/perlfunc/quotemeta.md) — Unicode- aware metacharacter escaping.