Unicode#

Perl’s strings are Unicode strings. Every code point from \x{0000} through \x{10FFFF} is a valid character, regardless of the underlying byte encoding. Regexps operate on characters, not bytes - so \w, \d, \s, ., and \X all understand the Unicode world.

This chapter covers how to write Unicode characters in a pattern, how to match by Unicode property, and how the charset modifiers /a, /u, /l, /d tune the default character-class behaviour. It closes with Script Runs, the mechanism for rejecting strings that mix scripts when they shouldn’t.

Writing Unicode in a pattern#

Four ways to denote a specific code point:

Μορφή	Example	Σημειώσεις
`\xHH`	`\x41` (A)	Two hex digits, 0–255
`\x{…}`	`\x{263a}` (☺)	Any code point
`\o{…}`	`\o{101}` (A)	Octal
`\N{name}`	`\N{GREEK SMALL LETTER SIGMA}`	Unicode character name

/\x{263a}/;                 # WHITE SMILING FACE
/\N{GREEK SMALL LETTER SIGMA}/;  # σ
/\N{U+03C3}/;               # same codepoint via U+ form

Whitespace and underscores inside \N{…} and \p{…} are ignored, so you can write \N{GREEK SMALL LETTER SIGMA} or \N{greek_small_letter_sigma}.

Inside a character class, the same escapes work:

/[\x{0370}-\x{03ff}]/;       # any Greek block character
/[\N{SECTION SIGN}\N{PILCROW SIGN}]/;

\N{NAME} (named character) is not the same as \N (any character except \n - see the character classes chapter). The presence or absence of the brace is what selects.

Matching by property: `\p{…}` and `\P{…}`#

Unicode classifies every code point by dozens of properties. The most useful in everyday patterns:

Property	Ταιριάζει με
`\p{L}`	any letter
`\p{Ll}`	lowercase letter
`\p{Lu}`	uppercase letter
`\p{Lt}`	titlecase letter
`\p{N}`	any numeric character
`\p{Nd}`	decimal digit
`\p{P}`	στίξη
`\p{M}`	mark (combining character)
`\p{S}`	symbol
`\p{Z}`	separator (including spaces)
`\p{C}`	control, format, unassigned, private-use
`\p{ASCII}`	code point 0–127
`\p{White_Space}`	any Unicode whitespace

\P{X} is the negation of \p{X} - matches any code point not in property X.

Short single-letter properties drop the braces:

/\pL/;    # letter - same as \p{L}
/\pN/;    # number
/\pM/;    # mark

Compound properties#

Some properties have multiple values. The compound form \p{Property=Value} (or \p{Property:Value}) spells them out:

/\p{General_Category=Uppercase_Letter}/;    # same as \p{Lu}
/\p{Block=Greek_and_Coptic}/;
/\p{Numeric_Type=Decimal}/;
/\p{Bidi_Class=R}/;                          # right-to-left characters

Case, spaces, and underscores are ignored in the property name and value, so \p{gc=lu} is the same as \p{General_Category=Uppercase_Letter}.

For the full list of properties and values, see perluniprops - that list is generated from Unicode data and is not duplicated here.

Matching by script#

Scripts (writing systems) are properties too:

/\p{Latin}/;
/\p{Greek}/;
/\p{Cyrillic}/;
/\p{Arabic}/;
/\p{Han}/;            # Chinese, Japanese kanji, Korean hanja
/\p{Hiragana}/;
/\p{Katakana}/;
/\p{Hangul}/;
/\p{Script=Greek}/;   # compound form, same meaning

For linguistic classification, prefer \p{Script_Extensions=…} - it handles code points used by multiple scripts more sensibly than \p{Script=…}. Perl’s \p{Greek} shorthand actually uses Script_Extensions under the hood.

Grapheme clusters: `\X`#

A grapheme cluster is what users perceive as a single character. A Danish å can be one code point (U+00E5) or two (A followed by combining ring U+030A). Both look identical on screen. \X matches either - one atomic user-visible character.

my $s = "A\x{030A}";     # A + combining ring
length($s);              # 2  - code-point count
$s =~ /./;               # matches just the 'A'
$s =~ /\X/;              # matches the whole cluster

Use \X when iterating visibly across user-facing text: cursor movement, column counting, truncation, anywhere «characters» means «what the user sees».

Extended bracketed character classes - `(?[…])`#

The standard […] form takes unions of characters; (?[…]) adds set-algebra operators that combine Unicode properties:

# Letters that are Greek:
/(?[ \p{Letter} & \p{Greek} ])/;

# Letters that are not Latin:
/(?[ \p{Letter} - \p{Latin} ])/;

# Hex digits without 'a' through 'f':
/(?[ [0-9A-Fa-f] - [a-f] ])/;

Operators: + (or |) for union, & for intersection, - for difference, ^ for symmetric difference, ! for complement. The construct is itself a character class - it matches one character - and quantifies normally. The character classes chapter covers it in full.

The charset modifiers#

The flags /a, /u, /l, /d control how the shorthand classes (\d, \w, \s) and case folding behave. They are orthogonal to /i, /m, /s, /x. At most one can be in effect at a time; /aa is a refinement of /a.

`/u` - Unicode semantics#

\d, \w, \s, \b, case folding, and [[:alpha:]] use full Unicode definitions. This is the default when use v5.12 or later is in effect.

"item\x{0660}" =~ /\w+\d/u;   # matches - U+0660 is an Arabic-Indic zero
"naïve" =~ /\w+/u;            # matches the whole word

Modern code should use /u. use v5.12+ selects it automatically.

`/a` - ASCII-only classes#

Restricts \d, \w, \s, [[:…:]] to the ASCII range. Useful when you want English-like text and no more:

"item\x{0660}" =~ /\w+\d/a;   # does not match - \d is [0-9] under /a
"item0"        =~ /\w+\d/a;   # matches

`/aa` - strict ASCII#

/aa (doubled) additionally prevents case-insensitive matches crossing the ASCII/non-ASCII boundary. Under plain /ia, K could match \x{212A} (KELVIN SIGN) because the Kelvin sign casefolds to k. Under /iaa it does not:

"K"        =~ /k/i;      # matches
"\x{212A}" =~ /k/i;      # matches - Kelvin sign casefolds to 'k'
"\x{212A}" =~ /k/iaa;    # does NOT match - ASCII-only i

/aa is the strict form when the input is genuinely ASCII-only and you want the engine to refuse Unicode matches even under /i.

`/l` - locale semantics#

\d, \w, \s, case folding, and [[:alpha:]] follow the current POSIX locale. Only sensible after use locale. Rarely what you want - locale behaviour is implementation-specific and harder to reason about than /a or /u. Provided for legacy code that genuinely needs locale-dependent matching.

`/d` - dual-mode semantics#

Under /d, \d, \w, \s depend on whether the target string is flagged as UTF-8 internally. This is the source of most «works on my machine» Unicode regexp bugs. Avoid /d in new code. The default under use v5.12 is /u.

If you inherit code where /d matters, the failure mode is: the same input string produces different match results depending on whether some upstream operation flagged the string UTF-8. The fix is use feature 'unicode_strings' (or use v5.12+), which makes the rules independent of the UTF-8 flag.

Which modifier is in effect?#

Resolution order:

Explicit modifier on the regex (/a, /u, /l, /d).
use re '/u' (or other) in the lexical scope.
use locale selects /l.
use feature 'unicode_strings' (and use v5.12+) selects /u.
Otherwise, /d (the dual-mode fallback; avoid).

Programs should use use v5.12; (or later) at the top of each file; this gives you /u, which is what you want.

Pragma interactions#

use v5.12;                          # implicit feature 'unicode_strings'
use feature 'unicode_strings';       # explicit
use re '/u';                        # set default modifier for this scope

use feature 'unicode_strings' (included in use v5.12) makes all strings Unicode for character-class purposes, regardless of whether they are marked UTF-8 internally. Turn it on at the top of every file.

Case folding#

Under /iu (the default), case-insensitive matching uses the full Unicode case-folding tables:

"Grüße" =~ /grüsse/iu;        # matches - ß folds to 'ss'
"Σίγμα" =~ /σίγμα/iu;          # matches - upper Σ ↔ lower σ
"İstanbul" =~ /istanbul/iu;   # matches - dotted I folds

Under /ia, fold only ASCII letters. Under /iaa, plus the Kelvin/K rule above.

The folding happens at compile time - pperl does not lowercase the input string at match time.

The `\b{…}` word-boundary variants#

Beyond plain \b, Perl offers semantic variants that handle real-world text better:

Όριο	Σημασία
`\b{wb}`	Όριο λέξης Unicode - χειρίζεται `don't`, `state-of-the-art`
`\b{sb}`	όριο πρότασης
`\b{lb}`	αλλαγή γραμμής (κατάλληλο για αναδίπλωση γραμμών)
`\b{gcb}`	grapheme cluster boundary (same effect as `\X`)

"don't" =~ /.+?\b{wb}/x;      # matches whole word - apostrophe is inside
"don't" =~ /.+?\b/x;          # stops at the apostrophe - plain \b splits

Για επεξεργασία φυσικής γλώσσας, τα \b{wb} και \b{sb} είναι σχεδόν πάντα αυτό που θέλετε. Το απλό \b είναι για περιβάλλοντα αναγνωριστικών ASCII.

The anchors and assertions chapter has the broader boundary discussion; this chapter is the home for the Unicode-aware variants because their definitions are Unicode-driven.

Script runs#

Different writing systems have visually-similar characters. The Latin a, the Cyrillic а (U+0430), and the Greek α (U+03B1) look indistinguishable. Used together - typically in a URL or identifier - they enable a homograph attack:

paypal.com

Could be all Latin (legitimate). Could be a Cyrillic а embedded in otherwise-Latin text (a phishing attack pointing to a different domain). The browser’s URL renders identically; the bytes do not.

A script run is a sequence of characters all from the same Unicode script (per Script_Extensions and Unicode UTS 39). Most legitimate words are script runs; mixed-script strings are rare outside specific multilingual contexts and almost always suspicious.

Perl provides four constructs to require that a matched sub-pattern be a script run:

Μορφή	Σημασία
`(script_run:PAT)` / `(sr:PAT)`	match `PAT`, then check it is a script run
`(atomic_script_run:PAT)` / `(asr:PAT)`	same, but atomic (faster, less backtracking)

# Match a domain label only if it is a script run:
$label =~ /(*sr: \w+ )/x or warn "mixed-script label";

# Atomic version: faster on adversarial input.
$label =~ /(*asr: \w+ )/x or warn "mixed-script label";

The atomic form is what you want in most production code - it prevents the script-run check from being defeated by catastrophic backtracking on a malicious input. (*sr:…) is the non-atomic equivalent ((*sr:(?>PAT)) is the same as (*asr:PAT)).

Script run rules#

A sequence is a script run if and only if all of:

No code point in the sequence has the Script_Extension property Unknown. This excludes private-use and surrogate code points; you cannot smuggle them through.
All characters come from the Common script, the Inherited script, and at most one other script.
All decimal digits come from the same set of ten. Many scripts have their own digits (Arabic-Indic, Devanagari, etc.); a string mixing 1 (ASCII) with ١ (Arabic-Indic one) is not a script run, even if other characters are compatible.

Three pseudo-scripts get special handling:

Common - punctuation, ASCII digits 0–9, mathematical symbols, emoji, full-width digits. These can appear in any script run except that all decimal digits in the sequence must still come from the same set of ten.
Inherited - combining marks (accents, diacritics). These attach to the previous character and inherit its script.
Unknown - unassigned code points and similar. A single-character string of unknown is allowed; a longer string containing one is not.

The relaxation for Common and Inherited is what allows real-world text to be a script run despite punctuation and combining marks. The strict-equality rule for digits is what defeats homograph attacks that mix digits from look-alike scripts.

When script runs help#

URL parsing: refuse domain labels that are not script runs.
Identifier validation: programming-language identifiers should normally be script runs.
Form input sanitisation: real names are script runs; mixed-script names are usually attacks or test data.

Script runs are a 2018-and-later feature. They are what Perl’s regex engine adds to a problem the rest of the engine cannot solve directly: a property of the whole match, not of any single character.

Cross-engine: Unicode coverage#

Χαρακτηριστικό	Perl 5.42	PCRE2	Emacs (limited)	POSIX	RE2 / Go
`\p{General_Category}`	ναι	ναι	partial	όχι	ναι
`\p{Script}`	ναι	ναι	partial	όχι	ναι
`\p{Property=Value}`	ναι	ναι	partial	όχι	ναι
`\X` grapheme cluster	ναι	ναι	όχι	όχι	yes (`(?-X:...)` is something else)
`\b{wb}`, `\b{sb}`	ναι	όχι	όχι	όχι	όχι
Script Runs `(*sr:…)`	ναι	όχι	όχι	όχι	όχι
Default to Unicode `\d/\w`	under `/u`	όχι	όχι	όχι	no (Unicode under `(?u)`)

Perl is a leader here: \b{wb} and Script Runs are essentially unique to Perl among the comparison set. RE2 / Go and PCRE2 both default to ASCII for \d, \w, \s; RE2 enables Unicode under the (?u) flag. PCRE2 has property support but lacks the boundary variants and script runs. Emacs has property support that varies by build.

The full table is in the cross-engine chapter.

Encoding vs. character#

A Perl string is always a sequence of code points. The bytes on disk are the I/O layer’s concern - use use utf8 for source- code literals, :encoding(UTF-8) on filehandles, and decode/encode from Encode when crossing the boundary. Patterns never operate on the encoded bytes; they always operate on code points.

use utf8;            # source code is UTF-8
use feature 'unicode_strings';
use open ':std', ':encoding(UTF-8)';  # STDIN/STDOUT/STDERR

while (my $line = <>) {
    $line =~ /\p{Lu}/ and print "has uppercase letter\n";
}

Get the I/O right once, and patterns just work on characters for the rest of the program.

Δείτε επίσης#

The character classes chapter - non- Unicode class basics, (?[…]).
The anchors and assertions chapter - \b, \B, plus the \b{…} family in their boundary role.
The modifiers chapter - /i, /x, /m, /s, and how they combine with charset modifiers.
The cross-engine chapter - Unicode coverage across regex engines.
quotemeta - Unicode- aware metacharacter escaping.