Character classes#

A character class matches exactly one character, chosen from a set you define. Where a literal a matches only a, the class [abc] matches any one of a, b, or c.

/cat/;               # matches 'cat'
/[bcr]at/;           # matches 'bat', 'cat', or 'rat'
/item[0123456789]/;  # matches 'item0' through 'item9'

The class still consumes one character of the string. [bcr]at never matches at (no letter present) and never matches brat (two letters where the class expects one).

Ranges#

Inside […], a dash between two characters denotes a contiguous range in the underlying character set:

/[0-9]/;         # any ASCII digit
/[a-z]/;         # any ASCII lowercase letter
/[a-zA-Z]/;      # any ASCII letter
/[0-9a-fA-F]/;   # any hex digit

Ranges can be combined with individual characters:

/[0-9bx-z]aa/;   # matches '0aa'..'9aa', 'baa', 'xaa', 'yaa', 'zaa'

A dash that is first or last inside the class is literal:

/[-ab]/;         # matches '-', 'a', or 'b'
/[ab-]/;         # same

Negation#

A caret ^ as the first character inside […] inverts the class:

/[^a]/;          # any character except 'a'
/[^0-9]/;        # any non-digit

A caret elsewhere is literal:

/[a^]/;          # matches 'a' or '^'

A negated class still matches one character - [^a] does not match the empty string; it requires one non-a.

Special characters inside a class#

Inside […] the special set shrinks to - ] \ ^ $ (and the pattern delimiter). The others - ., *, +, ?, (, ), {, }, | - are literals in a class:

/[.+*]/;         # matches a literal '.', '+', or '*'
/[()]/;          # matches '(' or ')'

To match ] inside the class, either escape it or put it first (after any leading ^):

/[\]]/;          # matches ']'
/[]ab]/;         # matches ']', 'a', or 'b'

$ and \ are slightly awkward because they interact with interpolation and escaping:

my $x = 'bcr';
/[$x]at/;        # matches 'bat', 'cat', or 'rat' - interpolated
/[\$x]at/;       # matches '$at' or 'xat' - '$' is literal
/[\\$x]at/;      # matches '\at' plus interpolation of $x

\b inside a character class means backspace (\x08), not “word boundary”. Outside a class it is the word boundary assertion. This is the single most common dual-meaning trap in the regex syntax - see the anchors and assertions chapter for the boundary form.

Shorthand classes#

Several common classes have shorthand names usable both inside and outside […]:

Shorthand	Matches
`\d`	a digit
`\D`	a non-digit
`\w`	a word character (alphanumeric or `_`)
`\W`	a non-word character
`\s`	whitespace (space, tab, `\r`, `\n`, `\f`, more)
`\S`	non-whitespace
`\h`	horizontal whitespace (space, tab, Unicode)
`\H`	non-horizontal-whitespace
`\v`	vertical whitespace (`\n`, `\r`, `\f`, …)
`\V`	non-vertical-whitespace
`\R`	linebreak: `\r\n`, `\n`, `\v`, `\f`, `\x{85}`, …
`\N`	any character except `\n` (regardless of `/s`)

Under Unicode (the default), \d, \w, \s match more than just ASCII. \d matches any Unicode digit (Devanagari digits, Arabic-Indic digits, and many more), \w matches any letter in any script plus marks and connector punctuation, and \s adds Unicode space characters such as non-breaking space.

To restrict these to ASCII, add the /a modifier or use explicit ranges like [0-9] and [A-Za-z_0-9].

"item0" =~ /\w\w\w\w\d/;     # matches
"abc\x{0660}" =~ /\w\w\w\d/; # matches: U+0660 is an Arabic-Indic zero
"abc\x{0660}" =~ /\w\w\w\d/a;# does not match under /a

\R is the linebreak shorthand - it matches any of the recognised line-break sequences as one token. Useful for parsing text that may have CRLF, LF, or rarer line terminators interchangeably. Unlike a class, \R may match two characters (the CRLF case) and so cannot appear inside […].

\N (uppercase) means “any character except \n”, and is not affected by the /s modifier. This is the dual meaning to be careful of: \N{NAME} (with brace) is a Unicode named-character escape (see the unicode chapter); bare \N is the non-newline class.

The period#

. matches any single character except newline. Under the /s modifier (covered in the modifiers chapter), . also matches newline:

"a\nb" =~ /a.b/;        # does not match
"a\nb" =~ /a.b/s;       # matches

When you want “any character including newline” without /s, the classic idiom is [\s\S] (or [\d\D]):

"a\nb" =~ /a[\s\S]b/;   # matches without /s

The trick is that any character is either a whitespace or a non-whitespace; the class covers both.

Composing classes#

You can mix shorthands, ranges, and individual characters inside one class:

/[\d\s]/;        # a digit or whitespace
/[A-Z\d_]/;      # uppercase letter, digit, or underscore
/[a-zA-Z\d]/;    # letter or digit (ASCII)

De Morgan’s law matters: [^\d\w] is not [\D\W]. The first requires the character to be both non-digit and non-word. But every digit is a word character, so [^\d\w] simplifies to [^\w], i.e. \W. Be careful when combining negated shorthands.

POSIX classes#

POSIX character classes use the form [:name:] and only work inside […]:

POSIX	Equivalent
`[:alpha:]`	alphabetic
`[:alnum:]`	alphanumeric
`[:digit:]`	digit (like `\d`)
`[:word:]`	word char (Perl extension)
`[:space:]`	whitespace (like `\s`)
`[:upper:]`	uppercase
`[:lower:]`	lowercase
`[:xdigit:]`	hex digit
`[:ascii:]`	0x00–0x7F
`[:cntrl:]`	control character
`[:graph:]`	printable, not space
`[:print:]`	printable, including space
`[:punct:]`	punctuation
`[:blank:]`	space or tab

Negate a POSIX class with ^ inside the colons:

/[[:^digit:]]/;   # same as \D
/[[:alpha:][:digit:]]/;  # letter or digit - equivalent to \w minus '_'

POSIX classes follow the same Unicode-vs-ASCII rules as the shorthands: without /a, [:alpha:] is the Unicode alphabetic set.

POSIX also defines two related constructs that are rarely implemented:

Collating elements [.span-ll.] - match a multi-character collation element as a single unit (e.g. Spanish ll historically).
Equivalence classes [[=n=]] - match any character that is equivalent under the locale’s collation rules (e.g. accented variants of n).

Perl recognises the syntax but treats both forms as the literal characters. In practice no portable script relies on these; they are documented for completeness.

Unicode properties#

Unicode defines thousands of properties. The notation is \p{Name} for “has this property” and \P{Name} for “does not have this property”:

/\p{Lu}/;              # any uppercase letter, any script
/\p{Greek}/;           # any character in the Greek script
/\p{Number}/;          # any numeric character
/\P{ASCII}/;           # any non-ASCII character

Short single-letter aliases exist for common properties and drop the braces: \pL is a letter, \pN a number, \pP punctuation. \p{L} is the same as \pL.

The unicode chapter covers properties in detail, including the compound form \p{Name=Value}, the \X grapheme cluster, and the charset modifiers /a, /u, /l, /d.

Extended bracketed classes - `(?[ ])`#

The standard […] syntax handles unions (“any of these characters”) well but does not have set operations on classes. The extended form (?[ … ]) does:

Operator	Meaning
`+` or `\|`	union (the same characters either operand has)
`&`	intersection (in both)
`-`	difference (in left, not in right)
`^`	symmetric difference (in one but not both)
`!`	complement (everything except)

Whitespace inside (?[…]) is ignored, so the operators read as arithmetic.

# Greek letters only:
/(?[ \p{Greek} & \p{Letter} ])/;

# Letters that are not Latin:
/(?[ \p{Letter} - \p{Latin} ])/;

# Hex digit, but not 'a' through 'f':
/(?[ [0-9A-Fa-f] - [a-f] ])/;

The construct is most useful when combining Unicode properties that overlap. Without it, the same expressions would require verbose lookaround or out-of-pattern logic.

(?[…]) is itself a character class - it consumes one character and can be quantified:

/(?[ \p{Letter} & \p{ASCII} ])+ /x;   # ASCII letters

Caveat: (?[…]) is its own little parser inside the regex parser. Inside it, only specific operators and operands are recognised. Mistakes produce specific compile-time errors, which in turn means strict mode (use re 'strict') catches more bad class expressions when you use the extended form.

Negated class beats `.*?`#

A common newcomer pattern: use .*? (non-greedy .) to match everything up to a delimiter, like <.*?>. The pattern works on the inputs you tested it on; on adversarial input it does not.

"<a> </a>"      =~ /<.*?>/;        # matches '<a>' - fine
"<a> <b>foo"    =~ /<.+?>foo/;     # matches '<a> <b>foo' - bad

In the second case the engine first matched <a>, then needed foo but found a space. Under that backtracking pressure the non-greedy .+? was forced to expand, gladly consuming the > of <a> and the space until foo lined up. The negated character class cannot give ground that way:

"<a> <b>foo" =~ /<[^>]+>foo/;      # matches '<b>foo' - [^>]+ refuses to cross '>'

Two reasons to prefer [^>]+ over .+? whenever the delimiter is a single character:

Correctness: the negated class is a hard barrier; the non-greedy form is a preference.
Performance: a negated class participates in the simple-repetition optimisation; .+? does not (the engine has to leave the inner loop on every iteration to test what follows). On long inputs this matters.

Worked example: an IP-address regex#

The canonical “specificity vs. complexity” exercise. Five iterations, each fixing one category of vagueness:

1. Naive. “Four dot-separated digit groups.”

/[0-9]*\.[0-9]*\.[0-9]*\.[0-9]*/;

Matches and then.....? happily - every group is optional, the pattern is satisfied by four dots and nothing else.

2. Require digits. Anchor the pattern.

/^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$/;

Matches 1234.5678.9101112.131415. Each group has digits but no upper bound on count.

3. Bound the digit count, badly.

/^\d{3}\.\d{3}\.\d{3}\.\d{3}$/;

Matches 192.168.001.001 but rejects 1.2.3.4 - leading zeros not always written.

4. Allow 1 to 3 digits.

/^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$/;

Now matches 1.2.3.4 and rejects 1234.5.6.7. But it also matches 999.999.999.999 - beyond the 0–255 range.

5. Range-correct.

/^(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.
   (?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.
   (?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.
   (?:25[0-5]|2[0-4]\d|[01]?\d\d?)$/x;

Each group is one of: 25[0-5] (250–255), 2[0-4]\d (200–249), or [01]?\d\d? (0–199 in various forms). The pattern now matches exactly the strings that name a syntactically-valid IPv4 address.

The lesson is not “memorise this regex”. It is the progression: each iteration tightened a specific kind of vagueness. The right regex for a problem is the one that admits exactly the right inputs and rejects everything else, and you only get there by asking what the previous regex actually allowed.

Cross-engine: shorthand classes#

The \d, \w, \s shorthands are not portable. The cross-engine chapter has the full table; the relevant rows extracted:

Shorthand	Perl 5.44 (default)	PCRE2	Emacs	POSIX BRE / ERE	RE2 / Go (default)
`\d`	ASCII (or Unicode under `/u`)	ASCII	NO (use `[0-9]` or `[[:digit:]]`)	NO	ASCII; Unicode under `(?u)`
`\w`	ASCII or Unicode	ASCII	yes (syntax-table-driven)	NO	ASCII; Unicode under `(?u)`
`\s`	ASCII or Unicode	ASCII	yes	NO	ASCII; Unicode under `(?u)`
`\b` word boundary	yes	yes	yes (`\<`/`\>` traditional)	NO	yes
`\h`, `\v`	yes	yes	NO	NO	NO

Two things to internalise:

POSIX BRE and ERE lack \d, \w, \s entirely. Portable shell scripts use [0-9], [[:alnum:]_], [[:space:]].
Emacs has \w and \s but no \d. Emacs’s \s syntax is followed by a syntax-class character (\s- for whitespace, \sw for word) - unique to Emacs.

POSIX bracket classes ([[:digit:]], [[:alpha:]], …) are the universal portable spelling: every engine in the comparison recognises them.

A useful habit#

Named and shorthand classes are almost always clearer than explicit ranges. \d{4}-\d{2}-\d{2} reads; [0-9]{4}-[0-9]{2}-[0-9]{2} needs a moment. Use the ranges only when you have a concrete reason - usually performance in a hot loop, or deliberately restricting to ASCII.