Unicode#
Perl’s strings are Unicode strings. Every code point from
\x{0000} through \x{10FFFF} is a valid character, regardless of
the underlying byte encoding. Regexps operate on characters, not
bytes — so \w, \d, \s, ., and \X all understand the
Unicode world.
This chapter covers how to write Unicode characters in a pattern,
how to match by Unicode property, and how the charset modifiers
/a, /u, /l, /d tune the default character-class behaviour.
Writing Unicode in a pattern#
Four ways to denote a specific code point:
Form |
Example |
Notes |
|---|---|---|
|
|
Two hex digits, 0–255 |
|
|
Any code point |
|
|
Octal |
|
|
Unicode character name |
/\x{263a}/; # WHITE SMILING FACE
/\N{GREEK SMALL LETTER SIGMA}/; # σ
/\N{U+03C3}/; # same codepoint via U+ form
Whitespace and underscores inside \N{…} and \p{…} are ignored,
so you can write \N{GREEK SMALL LETTER SIGMA} or
\N{greek_small_letter_sigma}.
Inside a character class, the same escapes work:
/[\x{0370}-\x{03ff}]/; # any Greek block character
/[\N{SECTION SIGN}\N{PILCROW SIGN}]/;
Matching by property: \p{…} and \P{…}#
Unicode classifies every code point by dozens of properties. The most useful in everyday patterns:
Property |
Matches |
|---|---|
|
any letter |
|
lowercase letter |
|
uppercase letter |
|
titlecase letter |
|
any numeric character |
|
decimal digit |
|
punctuation |
|
mark (combining character) |
|
symbol |
|
separator (including spaces) |
|
control, format, unassigned, private-use |
|
code point 0–127 |
|
any Unicode whitespace |
\P{X} is the negation of \p{X} — matches any code point not
in property X.
Short single-letter properties drop the braces:
/\pL/; # letter — same as \p{L}
/\pN/; # number
/\pM/; # mark
Matching by script#
Scripts (writing systems) are properties too:
/\p{Latin}/;
/\p{Greek}/;
/\p{Cyrillic}/;
/\p{Arabic}/;
/\p{Han}/; # Chinese, Japanese kanji, Korean hanja
/\p{Hiragana}/;
/\p{Katakana}/;
/\p{Hangul}/;
/\p{Script=Greek}/; # compound form, same meaning
For linguistic classification, prefer \p{Script_Extensions=…} —
it handles code points used by multiple scripts more sensibly than
\p{Script=…}. Perl’s \p{Greek} shorthand actually uses
Script_Extensions under the hood.
Compound properties#
Some properties have multiple values. The compound form
\p{Property=Value} (or \p{Property:Value}) spells them out:
/\p{General_Category=Uppercase_Letter}/; # same as \p{Lu}
/\p{Block=Greek_and_Coptic}/;
/\p{Numeric_Type=Decimal}/;
/\p{Bidi_Class=R}/; # right-to-left characters
Case, spaces, and underscores are ignored in the property name and
value, so \p{gc=lu} is the same as \p{General_Category=Uppercase_Letter}.
Grapheme clusters: \X#
A grapheme cluster is what users perceive as a single character.
A Danish å can be one code point (U+00E5) or two (A followed
by combining ring U+030A). Both look identical on screen. \X
matches either — one atomic user-visible character.
my $s = "A\x{030A}"; # A + combining ring
length($s); # 2 — code-point count
$s =~ /./; # matches just the 'A'
$s =~ /\X/; # matches the whole cluster
Use \X when iterating visibly across user-facing text: cursor
movement, column counting, truncation.
The charset modifiers#
The flags /a, /u, /l, /d control how the shorthand classes
(\d, \w, \s) and case folding behave. They are orthogonal to
/i, /m, /s, /x.
/u — Unicode semantics#
\d, \w, \s, \b, case folding, and [[:alpha:]] use full
Unicode definitions. This is the default when use v5.12 or
later is in effect.
"item\x{0660}" =~ /\w+\d/u; # matches — U+0660 is an Arabic-Indic zero
"naïve" =~ /\w+/u; # matches the whole word
/a — ASCII-only classes#
Restricts \d, \w, \s, [[:…:]] to the ASCII range. Useful
when you want English-like text and no more:
"item\x{0660}" =~ /\w+\d/a; # does not match — \d is [0-9] under /a
"item0" =~ /\w+\d/a; # matches
/aa (doubled) additionally prevents case-insensitive matches
crossing the ASCII/non-ASCII boundary. Under plain /ia, K could
match \x{212A} (KELVIN SIGN) because the Kelvin sign casefolds to
k. Under /iaa it does not:
"K" =~ /k/i; # matches
"\x{212A}" =~ /k/i; # matches — Kelvin sign casefolds to 'k'
"\x{212A}" =~ /k/iaa; # does NOT match — ASCII-only i
/l — locale semantics#
\d, \w, \s, case folding, and [[:alpha:]] follow the
current POSIX locale. Only sensible after use locale. Rarely
what you want — locale behaviour is implementation-specific and
harder to reason about than /a or /u.
/d — default pre-5.12 semantics#
Pre-Perl-5.14 rules: \d, \w, \s depend on whether the
target string is flagged as UTF-8 internally. This is the source
of most “works on my machine” Unicode regexp bugs. Avoid /d in
new code. The default under use v5.12 is /u.
Pragma interactions#
Several pragmas affect default regexp behaviour:
use v5.12; # implicit feature 'unicode_strings'
use feature 'unicode_strings'; # explicit
use re '/u'; # set default modifier for this scope
use feature 'unicode_strings' (included in use v5.12) makes
all strings Unicode for character-class purposes, regardless of
whether they are marked UTF-8 internally. Turn it on at the top
of every file.
Case folding#
Under /iu (the default in modern Perl), case-insensitive
matching uses the full Unicode case-folding tables:
"Grüße" =~ /grüsse/iu; # matches — ß folds to 'ss'
"Σίγμα" =~ /σίγμα/iu; # matches — upper Σ ↔ lower σ
"İstanbul" =~ /istanbul/iu; # matches — dotted I folds
Under /ia, fold only ASCII letters. Under /iaa, plus the
Kelvin/K rule above.
The \b{…} word-boundary variants#
Beyond plain \b, Perl offers semantic variants that handle
real-world text better:
\b{wb}— word boundary by Unicode “word break” rules. Handles apostrophes, hyphens, numerics-with-commas, and script-specific rules.\b{sb}— sentence boundary.\b{lb}— line break (suitable for wrapping).\b{gcb}— grapheme cluster boundary (same effect as\X).
"don't" =~ /.+?\b{wb}/x; # matches whole word — apostrophe is inside
"don't" =~ /.+?\b/x; # stops at the apostrophe — plain \b splits
For natural-language processing, \b{wb} and \b{sb} are almost
always what you want.
Encoding vs. character#
A Perl string is always a sequence of code points. The bytes on
disk are the I/O layer’s concern — use use utf8 for source-code
literals, :encoding(UTF-8) on filehandles, and decode/encode
from Encode when crossing the boundary. Patterns never operate on
the encoded bytes; they always operate on code points.
use utf8; # source code is UTF-8
use feature 'unicode_strings';
use open ':std', ':encoding(UTF-8)'; # STDIN/STDOUT/STDERR
while (my $line = <>) {
$line =~ /\p{Lu}/ and print "has uppercase letter\n";
}
Get the I/O right once, and patterns just work on characters for the rest of the program.
See also#
perlre— full property reference.quotemeta— Unicode-aware metacharacter escaping.The character classes chapter — non-Unicode class basics.
The modifiers chapter —
/i,/x, other modifiers that combine with charset modifiers.