---
name: Regex and Unicode properties
---
# Regex and properties

Once your strings are decoded, regex works on characters — and
Perl's regex engine knows about Unicode. `\w` matches every Unicode
word character, case-insensitive matching folds across scripts, and
`\p{…}` gives you access to the full Unicode character database.
This chapter covers the habits you need to make that work reliably.

## Characters, not bytes

On a text string, every regex metacharacter speaks in characters:

```perl
use utf8;
my $s = "café";                          # 4 characters
$s =~ /^.{4}$/;                          # matches
$s =~ /^.{5}$/;                          # does not
```

On a binary string, the same metacharacters count bytes. That is
usually not what you want — if you find yourself writing a regex
against bytes that hold text, decode first.

## `\w`, `\d`, `\s` and the friends

In default (Unicode) mode, these shorthands match their Unicode
equivalents:

- `\w` — letters, digits, and underscore from every script.
- `\d` — decimal digits from every script (not just `0`–`9`).
- `\s` — whitespace of every kind, including U+00A0 no-break space.

```perl
use utf8;
"café"  =~ /^\w+$/;                      # matches
"café"  =~ /^[a-z]+$/;                   # does not — é is outside
"\x{0660}\x{0661}\x{0662}" =~ /^\d+$/;   # matches Arabic-Indic digits
```

When you want the old ASCII-only meaning, add the `/a` modifier —
see *Modifiers* below.

## The `\p{…}` property classes

`\p{PROPERTY}` matches any character with a named Unicode property.
The property names are drawn from the Unicode character database;
the commonly useful ones are:

- `\p{Letter}`, shorthand `\p{L}` — any letter.
- `\p{Lu}`, `\p{Ll}`, `\p{Lt}` — upper, lower, title case letters.
- `\p{Number}`, `\p{N}`, `\p{Nd}` — all numbers, decimal digits.
- `\p{Punct}`, `\p{P}` — punctuation.
- `\p{Space}`, `\p{Zs}` — whitespace, space separators.
- `\p{ASCII}` — the 128 ASCII code points.
- `\p{Script=Greek}` — every character that belongs to the Greek
  script. Any script name from the Unicode data works here.
- `\p{Block=Cyrillic}` — every character in the Cyrillic block.
  (Script and block are different: *script* is the writing system
  a character belongs to; *block* is the code-point range it lives
  in.)

`\P{…}` is the negation — any character without that property.

```perl
use utf8;
my $s = "Γειά σου, κόσμε";
my @greek_letters = $s =~ /(\p{Script=Greek})/g;
scalar @greek_letters;                   # 12
```

The full catalogue of properties ships with Perl; `perldoc
perluniprops` enumerates every name the engine accepts.

## Modifiers: `/a`, `/u`, `/l`, `/aa`

Four modifiers change how character classes interpret themselves:

- `/u` — **Unicode mode.** `\w` matches Unicode word characters,
  `\d` matches Unicode decimal digits, case folding uses Unicode
  rules. This is the default on text strings.
- `/a` — **ASCII mode.** `\w` becomes `[A-Za-z0-9_]`, `\d` becomes
  `[0-9]`, `\s` becomes `[ \t\n\r\f]`. Case folding is still
  Unicode — `/foo/ai` still matches `FÖO` through `Ö` if you wrote
  `foo` — which is almost never what you want when you reached for
  `/a`.
- `/aa` — **strict ASCII.** `/a` plus case folding is restricted to
  ASCII. `/foo/aai` matches `FOO` but not `FÖO`. Use this when you
  want an identifier match against a known-ASCII specification
  (HTTP header names, command keywords).
- `/l` — **locale mode.** Defers to the current POSIX locale.
  Almost never what you want in a Unicode-aware program; mentioned
  only so you can recognise it when another codebase uses it.

```perl
use utf8;
"café" =~ /\w+/;                         # matches whole word (Unicode)
"café" =~ /\w+/a;                        # matches "caf" (ASCII only)
"FOO"  =~ /foo/ai;                       # matches (case fold anywhere)
"FÖO"  =~ /foo/aa;                       # does not — strict ASCII
```

Rule of thumb: default mode is right for text; switch to `/aa`
when you are parsing an ASCII-specified protocol and want to reject
smuggled non-ASCII characters.

## Case-insensitive matching

`/i` folds both sides of the match using Unicode case-folding
tables. This handles the obvious European pairs (`é`/`É`,
`ß`/`SS`), and the less obvious ones (`İ`/`i̇`, `ﬁ`/`fi`).

```perl
use utf8;
"Ångström" =~ /ångström/i;               # matches
"STRASSE"  =~ /straße/i;                 # matches — ß folds to SS
```

For literal-byte case folding (the old ASCII meaning of `/i`), use
`/iaa`.

## Named code points

`\N{…}` names a Unicode character by its official name:

```perl
use charnames ();
"\N{LATIN SMALL LETTER E WITH ACUTE}" eq "é";   # true
"\N{U+2014}" eq "—";                            # em-dash
```

`\N{U+HEX}` works everywhere; the long names require
`use charnames qw(:full)` or `use charnames qw(:loose)` in some
Perl versions. In a pattern, `\N{…}` is useful when you want the
source to document which character you mean without relying on the
reader to recognise a hex escape.

## Splitting and joining

[`split`](../../p5/core/perlfunc/split) and
[`substr`](../../p5/core/perlfunc/substr) both count in the units
of the string they operate on. On a text string, the count is
characters:

```perl
use utf8;
my $s = "a,é,b,Ω";
my @parts = split /,/, $s;               # ("a", "é", "b", "Ω")
substr($s, 2, 1);                        # "é"
```

On a binary string, the same operations count bytes. This is the
most common subtle bug — an old codebase that decoded input in one
function but indexed the result in another, built before the decode
step was introduced.