Unicode#

Perl’s strings are Unicode strings. Every code point from \x{0000} through \x{10FFFF} is a valid character, regardless of the underlying byte encoding. Regexps operate on characters, not bytes — so \w, \d, \s, ., and \X all understand the Unicode world.

This chapter covers how to write Unicode characters in a pattern, how to match by Unicode property, and how the charset modifiers /a, /u, /l, /d tune the default character-class behaviour.

Writing Unicode in a pattern#

Four ways to denote a specific code point:

Form

Example

Notes

\xHH

\x41 (A)

Two hex digits, 0–255

\x{…}

\x{263a} (☺)

Any code point

\o{…}

\o{101} (A)

Octal

\N{name}

\N{GREEK SMALL LETTER SIGMA}

Unicode character name

/\x{263a}/;                 # WHITE SMILING FACE
/\N{GREEK SMALL LETTER SIGMA}/;  # σ
/\N{U+03C3}/;               # same codepoint via U+ form

Whitespace and underscores inside \N{…} and \p{…} are ignored, so you can write \N{GREEK SMALL LETTER SIGMA} or \N{greek_small_letter_sigma}.

Inside a character class, the same escapes work:

/[\x{0370}-\x{03ff}]/;       # any Greek block character
/[\N{SECTION SIGN}\N{PILCROW SIGN}]/;

Matching by property: \p{…} and \P{…}#

Unicode classifies every code point by dozens of properties. The most useful in everyday patterns:

Property

Matches

\p{L}

any letter

\p{Ll}

lowercase letter

\p{Lu}

uppercase letter

\p{Lt}

titlecase letter

\p{N}

any numeric character

\p{Nd}

decimal digit

\p{P}

punctuation

\p{M}

mark (combining character)

\p{S}

symbol

\p{Z}

separator (including spaces)

\p{C}

control, format, unassigned, private-use

\p{ASCII}

code point 0–127

\p{White_Space}

any Unicode whitespace

\P{X} is the negation of \p{X} — matches any code point not in property X.

Short single-letter properties drop the braces:

/\pL/;    # letter — same as \p{L}
/\pN/;    # number
/\pM/;    # mark

Matching by script#

Scripts (writing systems) are properties too:

/\p{Latin}/;
/\p{Greek}/;
/\p{Cyrillic}/;
/\p{Arabic}/;
/\p{Han}/;            # Chinese, Japanese kanji, Korean hanja
/\p{Hiragana}/;
/\p{Katakana}/;
/\p{Hangul}/;
/\p{Script=Greek}/;   # compound form, same meaning

For linguistic classification, prefer \p{Script_Extensions=…} — it handles code points used by multiple scripts more sensibly than \p{Script=…}. Perl’s \p{Greek} shorthand actually uses Script_Extensions under the hood.

Compound properties#

Some properties have multiple values. The compound form \p{Property=Value} (or \p{Property:Value}) spells them out:

/\p{General_Category=Uppercase_Letter}/;    # same as \p{Lu}
/\p{Block=Greek_and_Coptic}/;
/\p{Numeric_Type=Decimal}/;
/\p{Bidi_Class=R}/;                          # right-to-left characters

Case, spaces, and underscores are ignored in the property name and value, so \p{gc=lu} is the same as \p{General_Category=Uppercase_Letter}.

Grapheme clusters: \X#

A grapheme cluster is what users perceive as a single character. A Danish å can be one code point (U+00E5) or two (A followed by combining ring U+030A). Both look identical on screen. \X matches either — one atomic user-visible character.

my $s = "A\x{030A}";     # A + combining ring
length($s);              # 2  — code-point count
$s =~ /./;               # matches just the 'A'
$s =~ /\X/;              # matches the whole cluster

Use \X when iterating visibly across user-facing text: cursor movement, column counting, truncation.

The charset modifiers#

The flags /a, /u, /l, /d control how the shorthand classes (\d, \w, \s) and case folding behave. They are orthogonal to /i, /m, /s, /x.

/u — Unicode semantics#

\d, \w, \s, \b, case folding, and [[:alpha:]] use full Unicode definitions. This is the default when use v5.12 or later is in effect.

"item\x{0660}" =~ /\w+\d/u;   # matches — U+0660 is an Arabic-Indic zero
"naïve" =~ /\w+/u;            # matches the whole word

/a — ASCII-only classes#

Restricts \d, \w, \s, [[:…:]] to the ASCII range. Useful when you want English-like text and no more:

"item\x{0660}" =~ /\w+\d/a;   # does not match — \d is [0-9] under /a
"item0"        =~ /\w+\d/a;   # matches

/aa (doubled) additionally prevents case-insensitive matches crossing the ASCII/non-ASCII boundary. Under plain /ia, K could match \x{212A} (KELVIN SIGN) because the Kelvin sign casefolds to k. Under /iaa it does not:

"K"           =~ /k/i;      # matches
"\x{212A}"    =~ /k/i;      # matches — Kelvin sign casefolds to 'k'
"\x{212A}"    =~ /k/iaa;    # does NOT match — ASCII-only i

/l — locale semantics#

\d, \w, \s, case folding, and [[:alpha:]] follow the current POSIX locale. Only sensible after use locale. Rarely what you want — locale behaviour is implementation-specific and harder to reason about than /a or /u.

/d — default pre-5.12 semantics#

Pre-Perl-5.14 rules: \d, \w, \s depend on whether the target string is flagged as UTF-8 internally. This is the source of most “works on my machine” Unicode regexp bugs. Avoid /d in new code. The default under use v5.12 is /u.

Pragma interactions#

Several pragmas affect default regexp behaviour:

use v5.12;                # implicit feature 'unicode_strings'
use feature 'unicode_strings';   # explicit
use re '/u';              # set default modifier for this scope

use feature 'unicode_strings' (included in use v5.12) makes all strings Unicode for character-class purposes, regardless of whether they are marked UTF-8 internally. Turn it on at the top of every file.

Case folding#

Under /iu (the default in modern Perl), case-insensitive matching uses the full Unicode case-folding tables:

"Grüße" =~ /grüsse/iu;        # matches — ß folds to 'ss'
"Σίγμα" =~ /σίγμα/iu;          # matches — upper Σ ↔ lower σ
"İstanbul" =~ /istanbul/iu;   # matches — dotted I folds

Under /ia, fold only ASCII letters. Under /iaa, plus the Kelvin/K rule above.

The \b{…} word-boundary variants#

Beyond plain \b, Perl offers semantic variants that handle real-world text better:

  • \b{wb} — word boundary by Unicode “word break” rules. Handles apostrophes, hyphens, numerics-with-commas, and script-specific rules.

  • \b{sb} — sentence boundary.

  • \b{lb} — line break (suitable for wrapping).

  • \b{gcb} — grapheme cluster boundary (same effect as \X).

"don't" =~ /.+?\b{wb}/x;      # matches whole word — apostrophe is inside
"don't" =~ /.+?\b/x;          # stops at the apostrophe — plain \b splits

For natural-language processing, \b{wb} and \b{sb} are almost always what you want.

Encoding vs. character#

A Perl string is always a sequence of code points. The bytes on disk are the I/O layer’s concern — use use utf8 for source-code literals, :encoding(UTF-8) on filehandles, and decode/encode from Encode when crossing the boundary. Patterns never operate on the encoded bytes; they always operate on code points.

use utf8;            # source code is UTF-8
use feature 'unicode_strings';
use open ':std', ':encoding(UTF-8)';  # STDIN/STDOUT/STDERR

while (my $line = <>) {
    $line =~ /\p{Lu}/ and print "has uppercase letter\n";
}

Get the I/O right once, and patterns just work on characters for the rest of the program.

See also#

  • perlre — full property reference.

  • quotemeta — Unicode-aware metacharacter escaping.

  • The character classes chapter — non-Unicode class basics.

  • The modifiers chapter — /i, /x, other modifiers that combine with charset modifiers.