# Unicode

Perl’s strings are Unicode strings. Every code point from
`\x{0000}` through `\x{10FFFF}` is a valid character, regardless
of the underlying byte encoding. Regexps operate on characters,
not bytes — so `\w`, `\d`, `\s`, `.`, and `\X` all understand
the Unicode world.

This chapter covers how to write Unicode characters in a
pattern, how to match by Unicode property, and how the *charset
modifiers* `/a`, `/u`, `/l`, `/d` tune the default
character-class behaviour. It closes with *Script Runs*, the
mechanism for rejecting strings that mix scripts when they
shouldn’t.

## Writing Unicode in a pattern

Four ways to denote a specific code point:

| Form       | Example                        | Notes                  |
|------------|--------------------------------|------------------------|
| `\xHH`     | `\x41` (A)                     | Two hex digits, 0–255  |
| `\x{…}`    | `\x{263a}` (☺)                 | Any code point         |
| `\o{…}`    | `\o{101}` (A)                  | Octal                  |
| `\N{name}` | `\N{GREEK SMALL LETTER SIGMA}` | Unicode character name |
```perl
/\x{263a}/;                 # WHITE SMILING FACE
/\N{GREEK SMALL LETTER SIGMA}/;  # σ
/\N{U+03C3}/;               # same codepoint via U+ form
```

Whitespace and underscores inside `\N{…}` and `\p{…}` are
ignored, so you can write `\N{GREEK SMALL LETTER SIGMA}` or
`\N{greek_small_letter_sigma}`.

Inside a character class, the same escapes work:

```perl
/[\x{0370}-\x{03ff}]/;       # any Greek block character
/[\N{SECTION SIGN}\N{PILCROW SIGN}]/;
```

`\N{NAME}` (named character) is *not* the same as `\N` (any
character except `\n` — see the
[character classes](character-classes.md) chapter). The presence or
absence of the brace is what selects.

## Matching by property: `\p{…}` and `\P{…}`

Unicode classifies every code point by dozens of properties.
The most useful in everyday patterns:

| Property          | Matches                                  |
|-------------------|------------------------------------------|
| `\p{L}`           | any letter                               |
| `\p{Ll}`          | lowercase letter                         |
| `\p{Lu}`          | uppercase letter                         |
| `\p{Lt}`          | titlecase letter                         |
| `\p{N}`           | any numeric character                    |
| `\p{Nd}`          | decimal digit                            |
| `\p{P}`           | punctuation                              |
| `\p{M}`           | mark (combining character)               |
| `\p{S}`           | symbol                                   |
| `\p{Z}`           | separator (including spaces)             |
| `\p{C}`           | control, format, unassigned, private-use |
| `\p{ASCII}`       | code point 0–127                         |
| `\p{White_Space}` | any Unicode whitespace                   |

`\P{X}` is the negation of `\p{X}` — matches any code point
*not* in property X.

Short single-letter properties drop the braces:

```perl
/\pL/;    # letter — same as \p{L}
/\pN/;    # number
/\pM/;    # mark
```

### Compound properties

Some properties have multiple values. The compound form
`\p{Property=Value}` (or `\p{Property:Value}`) spells them out:

```perl
/\p{General_Category=Uppercase_Letter}/;    # same as \p{Lu}
/\p{Block=Greek_and_Coptic}/;
/\p{Numeric_Type=Decimal}/;
/\p{Bidi_Class=R}/;                          # right-to-left characters
```

Case, spaces, and underscores are ignored in the property name
and value, so `\p{gc=lu}` is the same as
`\p{General_Category=Uppercase_Letter}`.

For the full list of properties and values, see
[`perluniprops`](https://perldoc.perl.org/perluniprops) — that
list is generated from Unicode data and is not duplicated here.

## Matching by script

Scripts (writing systems) are properties too:

```perl
/\p{Latin}/;
/\p{Greek}/;
/\p{Cyrillic}/;
/\p{Arabic}/;
/\p{Han}/;            # Chinese, Japanese kanji, Korean hanja
/\p{Hiragana}/;
/\p{Katakana}/;
/\p{Hangul}/;
/\p{Script=Greek}/;   # compound form, same meaning
```

For linguistic classification, prefer
`\p{Script_Extensions=…}` — it handles code points used by
multiple scripts more sensibly than `\p{Script=…}`. Perl’s
`\p{Greek}` shorthand actually uses `Script_Extensions` under
the hood.

## Grapheme clusters: `\X`

A *grapheme cluster* is what users perceive as a single
character. A Danish `å` can be one code point (`U+00E5`) or two
(`A` followed by combining ring `U+030A`). Both look identical
on screen. `\X` matches either — one atomic user-visible
character.

```perl
my $s = "A\x{030A}";     # A + combining ring
length($s);              # 2  — code-point count
$s =~ /./;               # matches just the 'A'
$s =~ /\X/;              # matches the whole cluster
```

Use `\X` when iterating visibly across user-facing text: cursor
movement, column counting, truncation, anywhere «characters»
means «what the user sees».

## Extended bracketed character classes — `(?[…])`

The standard `[…]` form takes unions of characters; `(?[…])`
adds set-algebra operators that combine Unicode properties:

```perl
# Letters that are Greek:
/(?[ \p{Letter} & \p{Greek} ])/;

# Letters that are not Latin:
/(?[ \p{Letter} - \p{Latin} ])/;

# Hex digits without 'a' through 'f':
/(?[ [0-9A-Fa-f] - [a-f] ])/;
```

Operators: `+` (or `|`) for union, `&` for intersection, `-`
for difference, `^` for symmetric difference, `!` for
complement. The construct is itself a character class — it
matches one character — and quantifies normally. The
[character classes](character-classes.md) chapter covers it in
full.

## The charset modifiers

The flags `/a`, `/u`, `/l`, `/d` control how the shorthand
classes (`\d`, `\w`, `\s`) and case folding behave. They are
orthogonal to `/i`, `/m`, `/s`, `/x`. At most one can be in
effect at a time; `/aa` is a refinement of `/a`.

### `/u` — Unicode semantics

`\d`, `\w`, `\s`, `\b`, case folding, and `[[:alpha:]]` use
full Unicode definitions. This is the default when `use v5.12`
or later is in effect.

```perl
"item\x{0660}" =~ /\w+\d/u;   # matches — U+0660 is an Arabic-Indic zero
"naïve" =~ /\w+/u;            # matches the whole word
```

Modern code should use `/u`. `use v5.12+` selects it
automatically.

### `/a` — ASCII-only classes

Restricts `\d`, `\w`, `\s`, `[[:…:]]` to the ASCII range. Useful
when you *want* English-like text and no more:

```perl
"item\x{0660}" =~ /\w+\d/a;   # does not match — \d is [0-9] under /a
"item0"        =~ /\w+\d/a;   # matches
```

### `/aa` — strict ASCII

`/aa` (doubled) additionally prevents case-insensitive matches
crossing the ASCII/non-ASCII boundary. Under plain `/ia`, `K`
could match `\x{212A}` (KELVIN SIGN) because the Kelvin sign
casefolds to `k`. Under `/iaa` it does not:

```perl
"K"        =~ /k/i;      # matches
"\x{212A}" =~ /k/i;      # matches — Kelvin sign casefolds to 'k'
"\x{212A}" =~ /k/iaa;    # does NOT match — ASCII-only i
```

`/aa` is the strict form when the input is genuinely
ASCII-only and you want the engine to refuse Unicode matches
even under `/i`.

### `/l` — locale semantics

`\d`, `\w`, `\s`, case folding, and `[[:alpha:]]` follow the
current POSIX locale. Only sensible after `use locale`. Rarely
what you want — locale behaviour is implementation-specific and
harder to reason about than `/a` or `/u`. Provided for legacy
code that genuinely needs locale-dependent matching.

### `/d` — dual-mode semantics

Under `/d`, `\d`, `\w`, `\s` depend on whether the target string
is flagged as UTF-8 internally. This is the source of most «works
on my machine» Unicode regexp bugs. **Avoid `/d` in new code.**
The default under `use v5.12` is `/u`.

If you inherit code where `/d` matters, the failure mode is:
the same input string produces different match results
depending on whether some upstream operation flagged the string
UTF-8. The fix is `use feature 'unicode_strings'` (or
`use v5.12+`), which makes the rules independent of the UTF-8
flag.

### Which modifier is in effect?

Resolution order:

1. Explicit modifier on the regex (`/a`, `/u`, `/l`, `/d`).
2. `use re '/u'` (or other) in the lexical scope.
3. `use locale` selects `/l`.
4. `use feature 'unicode_strings'` (and `use v5.12+`) selects
   `/u`.
5. Otherwise, `/d` (the dual-mode fallback; avoid).

Programs should use `use v5.12;` (or later) at the top of each
file; this gives you `/u`, which is what you want.

## Pragma interactions

```perl
use v5.12;                          # implicit feature 'unicode_strings'
use feature 'unicode_strings';       # explicit
use re '/u';                        # set default modifier for this scope
```

`use feature 'unicode_strings'` (included in `use v5.12`) makes
all strings Unicode for character-class purposes, regardless of
whether they are marked UTF-8 internally. Turn it on at the top
of every file.

## Case folding

Under `/iu` (the default), case-insensitive matching uses the
full Unicode case-folding tables:

```perl
"Grüße" =~ /grüsse/iu;        # matches — ß folds to 'ss'
"Σίγμα" =~ /σίγμα/iu;          # matches — upper Σ ↔ lower σ
"İstanbul" =~ /istanbul/iu;   # matches — dotted I folds
```

Under `/ia`, fold only ASCII letters. Under `/iaa`, plus the
Kelvin/K rule above.

The folding happens at compile time — pperl does not lowercase
the input string at match time.

## The `\b{…}` word-boundary variants

Beyond plain `\b`, Perl offers semantic variants that handle
real-world text better:

| Boundary   | Meaning                                                     |
|------------|-------------------------------------------------------------|
| `\b{wb}`   | Unicode word boundary — handles `don't`, `state-of-the-art` |
| `\b{sb}`   | sentence boundary                                           |
| `\b{lb}`   | line break (suitable for line wrapping)                     |
| `\b{gcb}`  | grapheme cluster boundary (same effect as `\X`)             |
```perl
"don't" =~ /.+?\b{wb}/x;      # matches whole word — apostrophe is inside
"don't" =~ /.+?\b/x;          # stops at the apostrophe — plain \b splits
```

For natural-language processing, `\b{wb}` and `\b{sb}` are
almost always what you want. Plain `\b` is for ASCII identifier
contexts.

The [anchors and assertions](anchors-and-assertions.md) chapter
has the broader boundary discussion; this chapter is the home
for the Unicode-aware variants because their definitions are
Unicode-driven.

## Script runs

Different writing systems have visually-similar characters. The
Latin `a`, the Cyrillic `а` (U+0430), and the Greek `α`
(U+03B1) look indistinguishable. Used together — typically in a
URL or identifier — they enable a *homograph attack*:

> paypal.com

Could be all Latin (legitimate). Could be a Cyrillic `а`
embedded in otherwise-Latin text (a phishing attack pointing to
a different domain). The browser’s URL renders identically; the
bytes do not.

A *script run* is a sequence of characters all from the same
Unicode script (per `Script_Extensions` and Unicode UTS 39).
Most legitimate words are script runs; mixed-script strings are
rare outside specific multilingual contexts and almost always
suspicious.

Perl provides four constructs to require that a matched
sub-pattern be a script run:

| Form                                      | Meaning                                      |
|-------------------------------------------|----------------------------------------------|
| `(*script_run:PAT)` / `(*sr:PAT)`         | match `PAT`, then check it is a script run   |
| `(*atomic_script_run:PAT)` / `(*asr:PAT)` | same, but atomic (faster, less backtracking) |
```perl
# Match a domain label only if it is a script run:
$label =~ /(*sr: \w+ )/x or warn "mixed-script label";

# Atomic version: faster on adversarial input.
$label =~ /(*asr: \w+ )/x or warn "mixed-script label";
```

The atomic form is what you want in most production code — it
prevents the script-run check from being defeated by
catastrophic backtracking on a malicious input. `(*sr:…)` is the
non-atomic equivalent (`(*sr:(?>PAT))` is the same as
`(*asr:PAT)`).

### Script run rules

A sequence is a script run if and only if all of:

1. **No code point in the sequence has the `Script_Extension`
   property `Unknown`.** This excludes private-use and
   surrogate code points; you cannot smuggle them through.
2. **All characters come from the Common script, the
   Inherited script, and at most one other script.**
3. **All decimal digits come from the same set of ten.** Many
   scripts have their own digits (Arabic-Indic, Devanagari,
   etc.); a string mixing `1` (ASCII) with `١` (Arabic-Indic
   one) is *not* a script run, even if other characters are
   compatible.

Three pseudo-scripts get special handling:

- **Common** — punctuation, ASCII digits 0–9, mathematical
  symbols, emoji, full-width digits. These can appear in any
  script run *except* that all decimal digits in the sequence
  must still come from the same set of ten.
- **Inherited** — combining marks (accents, diacritics). These
  attach to the previous character and inherit its script.
- **Unknown** — unassigned code points and similar. A
  single-character string of unknown is allowed; a longer string
  containing one is not.

The relaxation for Common and Inherited is what allows
real-world text to be a script run despite punctuation and
combining marks. The strict-equality rule for digits is what
defeats homograph attacks that mix digits from look-alike
scripts.

### When script runs help

- **URL parsing**: refuse domain labels that are not script
  runs.
- **Identifier validation**: programming-language identifiers
  should normally be script runs.
- **Form input sanitisation**: real names are script runs;
  mixed-script names are usually attacks or test data.

Script runs are a 2018-and-later feature. They are what Perl’s
regex engine adds to a problem the rest of the engine cannot
solve directly: a property of the *whole* match, not of any
single character.

## Cross-engine: Unicode coverage

| Feature                    | Perl 5.42   | PCRE2   | Emacs (limited)   | POSIX   | RE2 / Go                            |
|----------------------------|-------------|---------|-------------------|---------|-------------------------------------|
| `\p{General_Category}`     | yes         | yes     | partial           | no      | yes                                 |
| `\p{Script}`               | yes         | yes     | partial           | no      | yes                                 |
| `\p{Property=Value}`       | yes         | yes     | partial           | no      | yes                                 |
| `\X` grapheme cluster      | yes         | yes     | no                | no      | yes (`(?-X:...)` is something else) |
| `\b{wb}`, `\b{sb}`         | yes         | no      | no                | no      | no                                  |
| Script Runs `(*sr:…)`      | yes         | no      | no                | no      | no                                  |
| Default to Unicode `\d/\w` | under `/u`  | no      | no                | no      | no (Unicode under `(?u)`)           |

Perl is a leader here: `\b{wb}` and Script Runs are essentially
unique to Perl among the comparison set. RE2 / Go and PCRE2 both
default to ASCII for `\d`, `\w`, `\s`; RE2 enables Unicode under
the `(?u)` flag. PCRE2 has property support but lacks the
boundary variants and script runs. Emacs has property support
that varies by build.

The full table is in the [cross-engine](cross-engine.md) chapter.

## Encoding vs. character

A Perl string is always a sequence of code points. The bytes on
disk are the I/O layer’s concern — use `use utf8` for source-
code literals, `:encoding(UTF-8)` on filehandles, and
`decode`/`encode` from `Encode` when crossing the boundary.
Patterns never operate on the encoded bytes; they always
operate on code points.

```perl
use utf8;            # source code is UTF-8
use feature 'unicode_strings';
use open ':std', ':encoding(UTF-8)';  # STDIN/STDOUT/STDERR

while (my $line = <>) {
    $line =~ /\p{Lu}/ and print "has uppercase letter\n";
}
```

Get the I/O right once, and patterns just work on characters
for the rest of the program.

## See also

- The [character classes](character-classes.md) chapter — non-
  Unicode class basics, `(?[…])`.
- The [anchors and assertions](anchors-and-assertions.md) chapter —
  `\b`, `\B`, plus the `\b{…}` family in their boundary role.
- The [modifiers](modifiers.md) chapter — `/i`, `/x`, `/m`, `/s`,
  and how they combine with charset modifiers.
- The [cross-engine](cross-engine.md) chapter — Unicode coverage
  across regex engines.
- [`quotemeta`](../../p5/core/perlfunc/quotemeta.md) — Unicode-
  aware metacharacter escaping.