---
name: regex and unicode
---
# Unicode

Perl's strings are Unicode strings. Every code point from
`\x{0000}` through `\x{10FFFF}` is a valid character, regardless of
the underlying byte encoding. Regexps operate on characters, not
bytes — so `\w`, `\d`, `\s`, `.`, and `\X` all understand the
Unicode world.

This chapter covers how to write Unicode characters in a pattern,
how to match by Unicode property, and how the *charset modifiers*
`/a`, `/u`, `/l`, `/d` tune the default character-class behaviour.

## Writing Unicode in a pattern

Four ways to denote a specific code point:

| Form            | Example                       | Notes                  |
|-----------------|-------------------------------|------------------------|
| `\xHH`          | `\x41` (A)                    | Two hex digits, 0–255  |
| `\x{…}`         | `\x{263a}` (☺)                | Any code point         |
| `\o{…}`         | `\o{101}` (A)                 | Octal                  |
| `\N{name}`      | `\N{GREEK SMALL LETTER SIGMA}`| Unicode character name |

```perl
/\x{263a}/;                 # WHITE SMILING FACE
/\N{GREEK SMALL LETTER SIGMA}/;  # σ
/\N{U+03C3}/;               # same codepoint via U+ form
```

Whitespace and underscores inside `\N{…}` and `\p{…}` are ignored,
so you can write `\N{GREEK SMALL LETTER SIGMA}` or
`\N{greek_small_letter_sigma}`.

Inside a character class, the same escapes work:

```perl
/[\x{0370}-\x{03ff}]/;       # any Greek block character
/[\N{SECTION SIGN}\N{PILCROW SIGN}]/;
```

## Matching by property: \p{…} and \P{…}

Unicode classifies every code point by dozens of properties. The
most useful in everyday patterns:

| Property         | Matches                                |
|------------------|----------------------------------------|
| `\p{L}`          | any letter                             |
| `\p{Ll}`         | lowercase letter                       |
| `\p{Lu}`         | uppercase letter                       |
| `\p{Lt}`         | titlecase letter                       |
| `\p{N}`          | any numeric character                  |
| `\p{Nd}`         | decimal digit                          |
| `\p{P}`          | punctuation                            |
| `\p{M}`          | mark (combining character)             |
| `\p{S}`          | symbol                                 |
| `\p{Z}`          | separator (including spaces)           |
| `\p{C}`          | control, format, unassigned, private-use |
| `\p{ASCII}`      | code point 0–127                       |
| `\p{White_Space}`| any Unicode whitespace                 |

`\P{X}` is the negation of `\p{X}` — matches any code point *not*
in property X.

Short single-letter properties drop the braces:

```perl
/\pL/;    # letter — same as \p{L}
/\pN/;    # number
/\pM/;    # mark
```

## Matching by script

Scripts (writing systems) are properties too:

```perl
/\p{Latin}/;
/\p{Greek}/;
/\p{Cyrillic}/;
/\p{Arabic}/;
/\p{Han}/;            # Chinese, Japanese kanji, Korean hanja
/\p{Hiragana}/;
/\p{Katakana}/;
/\p{Hangul}/;
/\p{Script=Greek}/;   # compound form, same meaning
```

For linguistic classification, prefer `\p{Script_Extensions=…}` —
it handles code points used by multiple scripts more sensibly than
`\p{Script=…}`. Perl's `\p{Greek}` shorthand actually uses
`Script_Extensions` under the hood.

## Compound properties

Some properties have multiple values. The compound form
`\p{Property=Value}` (or `\p{Property:Value}`) spells them out:

```perl
/\p{General_Category=Uppercase_Letter}/;    # same as \p{Lu}
/\p{Block=Greek_and_Coptic}/;
/\p{Numeric_Type=Decimal}/;
/\p{Bidi_Class=R}/;                          # right-to-left characters
```

Case, spaces, and underscores are ignored in the property name and
value, so `\p{gc=lu}` is the same as `\p{General_Category=Uppercase_Letter}`.

## Grapheme clusters: \X

A *grapheme cluster* is what users perceive as a single character.
A Danish `å` can be one code point (`U+00E5`) or two (`A` followed
by combining ring `U+030A`). Both look identical on screen. `\X`
matches either — one atomic user-visible character.

```perl
my $s = "A\x{030A}";     # A + combining ring
length($s);              # 2  — code-point count
$s =~ /./;               # matches just the 'A'
$s =~ /\X/;              # matches the whole cluster
```

Use `\X` when iterating visibly across user-facing text: cursor
movement, column counting, truncation.

## The charset modifiers

The flags `/a`, `/u`, `/l`, `/d` control how the shorthand classes
(`\d`, `\w`, `\s`) and case folding behave. They are orthogonal to
`/i`, `/m`, `/s`, `/x`.

### /u — Unicode semantics

`\d`, `\w`, `\s`, `\b`, case folding, and `[[:alpha:]]` use full
Unicode definitions. This is the default when `use v5.12` or
later is in effect.

```perl
"item\x{0660}" =~ /\w+\d/u;   # matches — U+0660 is an Arabic-Indic zero
"naïve" =~ /\w+/u;            # matches the whole word
```

### /a — ASCII-only classes

Restricts `\d`, `\w`, `\s`, `[[:…:]]` to the ASCII range. Useful
when you *want* English-like text and no more:

```perl
"item\x{0660}" =~ /\w+\d/a;   # does not match — \d is [0-9] under /a
"item0"        =~ /\w+\d/a;   # matches
```

`/aa` (doubled) additionally prevents case-insensitive matches
crossing the ASCII/non-ASCII boundary. Under plain `/ia`, `K` could
match `\x{212A}` (KELVIN SIGN) because the Kelvin sign casefolds to
`k`. Under `/iaa` it does not:

```perl
"K"           =~ /k/i;      # matches
"\x{212A}"    =~ /k/i;      # matches — Kelvin sign casefolds to 'k'
"\x{212A}"    =~ /k/iaa;    # does NOT match — ASCII-only i
```

### /l — locale semantics

`\d`, `\w`, `\s`, case folding, and `[[:alpha:]]` follow the
current POSIX locale. Only sensible after `use locale`. Rarely
what you want — locale behaviour is implementation-specific and
harder to reason about than `/a` or `/u`.

### /d — default pre-5.12 semantics

Pre-Perl-5.14 rules: `\d`, `\w`, `\s` depend on whether the
target string is flagged as UTF-8 internally. This is the source
of most "works on my machine" Unicode regexp bugs. Avoid `/d` in
new code. The default under `use v5.12` is `/u`.

## Pragma interactions

Several pragmas affect default regexp behaviour:

```perl
use v5.12;                # implicit feature 'unicode_strings'
use feature 'unicode_strings';   # explicit
use re '/u';              # set default modifier for this scope
```

`use feature 'unicode_strings'` (included in `use v5.12`) makes
all strings Unicode for character-class purposes, regardless of
whether they are marked UTF-8 internally. Turn it on at the top
of every file.

## Case folding

Under `/iu` (the default in modern Perl), case-insensitive
matching uses the full Unicode case-folding tables:

```perl
"Grüße" =~ /grüsse/iu;        # matches — ß folds to 'ss'
"Σίγμα" =~ /σίγμα/iu;          # matches — upper Σ ↔ lower σ
"İstanbul" =~ /istanbul/iu;   # matches — dotted I folds
```

Under `/ia`, fold only ASCII letters. Under `/iaa`, plus the
Kelvin/K rule above.

## The \b{…} word-boundary variants

Beyond plain `\b`, Perl offers semantic variants that handle
real-world text better:

- `\b{wb}` — word boundary by Unicode "word break" rules. Handles
  apostrophes, hyphens, numerics-with-commas, and script-specific
  rules.
- `\b{sb}` — sentence boundary.
- `\b{lb}` — line break (suitable for wrapping).
- `\b{gcb}` — grapheme cluster boundary (same effect as `\X`).

```perl
"don't" =~ /.+?\b{wb}/x;      # matches whole word — apostrophe is inside
"don't" =~ /.+?\b/x;          # stops at the apostrophe — plain \b splits
```

For natural-language processing, `\b{wb}` and `\b{sb}` are almost
always what you want.

## Encoding vs. character

A Perl string is always a sequence of code points. The bytes on
disk are the I/O layer's concern — use `use utf8` for source-code
literals, `:encoding(UTF-8)` on filehandles, and `decode`/`encode`
from Encode when crossing the boundary. Patterns never operate on
the encoded bytes; they always operate on code points.

```perl
use utf8;            # source code is UTF-8
use feature 'unicode_strings';
use open ':std', ':encoding(UTF-8)';  # STDIN/STDOUT/STDERR

while (my $line = <>) {
    $line =~ /\p{Lu}/ and print "has uppercase letter\n";
}
```

Get the I/O right once, and patterns just work on characters for
the rest of the program.

## See also

- [`perlre`](../../p5/core/perlre) — full property reference.
- [`quotemeta`](../../p5/core/perlfunc/quotemeta) — Unicode-aware
  metacharacter escaping.
- The [character classes](character-classes) chapter — non-Unicode
  class basics.
- The [modifiers](modifiers) chapter — `/i`, `/x`, other modifiers
  that combine with charset modifiers.