Character classes#
A character class matches exactly one character, chosen from a set you define. Where a literal a matches only a, the class [abc] matches any one of a, b, or c.
/cat/; # matches 'cat'
/[bcr]at/; # matches 'bat', 'cat', or 'rat'
/item[0123456789]/; # matches 'item0' through 'item9'
The class still consumes one character of the string. [bcr]at never matches at (no letter present) and never matches brat (two letters where the class expects one).
Ranges#
Inside […], a dash between two characters denotes a contiguous range in the underlying character set:
/[0-9]/; # any ASCII digit
/[a-z]/; # any ASCII lowercase letter
/[a-zA-Z]/; # any ASCII letter
/[0-9a-fA-F]/; # any hex digit
Ranges can be combined with individual characters:
/[0-9bx-z]aa/; # matches '0aa'..'9aa', 'baa', 'xaa', 'yaa', 'zaa'
A dash that is first or last inside the class is literal:
/[-ab]/; # matches '-', 'a', or 'b'
/[ab-]/; # same
Negation#
A caret ^ as the first character inside […] inverts the class:
/[^a]/; # any character except 'a'
/[^0-9]/; # any non-digit
A caret elsewhere is literal:
/[a^]/; # matches 'a' or '^'
A negated class still matches one character — [^a] does not match the empty string; it requires one non-a.
Special characters inside a class#
Inside […] the special set shrinks to - ] \ ^ $ (and the pattern delimiter). The others — ., *, +, ?, (, ), {, }, | — are literals in a class:
/[.+*]/; # matches a literal '.', '+', or '*'
/[()]/; # matches '(' or ')'
To match ] inside the class, either escape it or put it first (after any leading ^):
/[\]]/; # matches ']'
/[]ab]/; # matches ']', 'a', or 'b'
$ and \ are slightly awkward because they interact with interpolation and escaping:
my $x = 'bcr';
/[$x]at/; # matches 'bat', 'cat', or 'rat' — interpolated
/[\$x]at/; # matches '$at' or 'xat' — '$' is literal
/[\\$x]at/; # matches '\at' plus interpolation of $x
\b inside a character class means backspace (\x08), not “word boundary”. Outside a class it is the word boundary assertion. This is the single most common dual-meaning trap in the regex syntax — see the anchors and assertions chapter for the boundary form.
Shorthand classes#
Several common classes have shorthand names usable both inside and outside […]:
Shorthand | Matches |
|---|---|
| a digit |
| a non-digit |
| a word character (alphanumeric or |
| a non-word character |
| whitespace (space, tab, |
| non-whitespace |
| horizontal whitespace (space, tab, Unicode) |
| non-horizontal-whitespace |
| vertical whitespace ( |
| non-vertical-whitespace |
| linebreak: |
| any character except |
Under Unicode (the default), \d, \w, \s match more than just ASCII. \d matches any Unicode digit (Devanagari digits, Arabic-Indic digits, and many more), \w matches any letter in any script plus marks and connector punctuation, and \s adds Unicode space characters such as non-breaking space.
To restrict these to ASCII, add the /a modifier or use explicit ranges like [0-9] and [A-Za-z_0-9].
"item0" =~ /\w\w\w\w\d/; # matches
"abc\x{0660}" =~ /\w\w\w\d/; # matches: U+0660 is an Arabic-Indic zero
"abc\x{0660}" =~ /\w\w\w\d/a;# does not match under /a
\R is the linebreak shorthand — it matches any of the recognised line-break sequences as one token. Useful for parsing text that may have CRLF, LF, or rarer line terminators interchangeably. Unlike a class, \R may match two characters (the CRLF case) and so cannot appear inside […].
\N (uppercase) means “any character except \n”, and is not affected by the /s modifier. This is the dual meaning to be careful of: \N{NAME} (with brace) is a Unicode named-character escape (see the unicode chapter); bare \N is the non-newline class.
The period#
. matches any single character except newline. Under the /s modifier (covered in the modifiers chapter), . also matches newline:
"a\nb" =~ /a.b/; # does not match
"a\nb" =~ /a.b/s; # matches
When you want “any character including newline” without /s, the classic idiom is [\s\S] (or [\d\D]):
"a\nb" =~ /a[\s\S]b/; # matches without /s
The trick is that any character is either a whitespace or a non-whitespace; the class covers both.
Composing classes#
You can mix shorthands, ranges, and individual characters inside one class:
/[\d\s]/; # a digit or whitespace
/[A-Z\d_]/; # uppercase letter, digit, or underscore
/[a-zA-Z\d]/; # letter or digit (ASCII)
De Morgan’s law matters: [^\d\w] is not [\D\W]. The first requires the character to be both non-digit and non-word. But every digit is a word character, so [^\d\w] simplifies to [^\w], i.e. \W. Be careful when combining negated shorthands.
POSIX classes#
POSIX character classes use the form [:name:] and only work inside […]:
POSIX | Equivalent |
|---|---|
| alphabetic |
| alphanumeric |
| digit (like |
| word char (Perl extension) |
| whitespace (like |
| uppercase |
| lowercase |
| hex digit |
| 0x00–0x7F |
| control character |
| printable, not space |
| printable, including space |
| punctuation |
| space or tab |
Negate a POSIX class with ^ inside the colons:
/[[:^digit:]]/; # same as \D
/[[:alpha:][:digit:]]/; # letter or digit — equivalent to \w minus '_'
POSIX classes follow the same Unicode-vs-ASCII rules as the shorthands: without /a, [:alpha:] is the Unicode alphabetic set.
POSIX also defines two related constructs that are rarely implemented:
Collating elements
[.span-ll.]— match a multi-character collation element as a single unit (e.g. Spanishllhistorically).Equivalence classes
[[=n=]]— match any character that is equivalent under the locale’s collation rules (e.g. accented variants ofn).
Perl recognises the syntax but treats both forms as the literal characters. In practice no portable script relies on these; they are documented for completeness.
Unicode properties#
Unicode defines thousands of properties. The notation is \p{Name} for “has this property” and \P{Name} for “does not have this property”:
/\p{Lu}/; # any uppercase letter, any script
/\p{Greek}/; # any character in the Greek script
/\p{Number}/; # any numeric character
/\P{ASCII}/; # any non-ASCII character
Short single-letter aliases exist for common properties and drop the braces: \pL is a letter, \pN a number, \pP punctuation. \p{L} is the same as \pL.
The unicode chapter covers properties in detail, including the compound form \p{Name=Value}, the \X grapheme cluster, and the charset modifiers /a, /u, /l, /d.
Extended bracketed classes — (?[ ])#
The standard […] syntax handles unions (“any of these characters”) well but does not have set operations on classes. The extended form (?[ … ]) does:
Operator | Meaning |
|---|---|
| union (the same characters either operand has) |
| intersection (in both) |
| difference (in left, not in right) |
| symmetric difference (in one but not both) |
| complement (everything except) |
Whitespace inside (?[…]) is ignored, so the operators read as arithmetic.
# Greek letters only:
/(?[ \p{Greek} & \p{Letter} ])/;
# Letters that are not Latin:
/(?[ \p{Letter} - \p{Latin} ])/;
# Hex digit, but not 'a' through 'f':
/(?[ [0-9A-Fa-f] - [a-f] ])/;
The construct is most useful when combining Unicode properties that overlap. Without it, the same expressions would require verbose lookaround or out-of-pattern logic.
(?[…]) is itself a character class — it consumes one character and can be quantified:
/(?[ \p{Letter} & \p{ASCII} ])+ /x; # ASCII letters
Caveat: (?[…]) is its own little parser inside the regex parser. Inside it, only specific operators and operands are recognised. Mistakes produce specific compile-time errors, which in turn means strict mode (use re 'strict') catches more bad class expressions when you use the extended form.
Negated class beats .*?#
A common newcomer pattern: use .*? (non-greedy .) to match everything up to a delimiter, like <.*?>. The pattern works on the inputs you tested it on; on adversarial input it does not.
"<a> </a>" =~ /<.*?>/; # matches '<a>' — fine
"<a> <b>foo" =~ /<.+?>foo/; # matches '<a> <b>foo' — bad
In the second case the engine first matched <a>, then needed foo but found a space. Under that backtracking pressure the non-greedy .+? was forced to expand, gladly consuming the > of <a> and the space until foo lined up. The negated character class cannot give ground that way:
"<a> <b>foo" =~ /<[^>]+>foo/; # matches '<b>foo' — [^>]+ refuses to cross '>'
Two reasons to prefer [^>]+ over .+? whenever the delimiter is a single character:
Correctness: the negated class is a hard barrier; the non-greedy form is a preference.
Performance: a negated class participates in the simple-repetition optimisation;
.+?does not (the engine has to leave the inner loop on every iteration to test what follows). On long inputs this matters.
Worked example: an IP-address regex#
The canonical “specificity vs. complexity” exercise. Five iterations, each fixing one category of vagueness:
1. Naive. “Four dot-separated digit groups.”
/[0-9]*\.[0-9]*\.[0-9]*\.[0-9]*/;
Matches and then.....? happily — every group is optional, the pattern is satisfied by four dots and nothing else.
2. Require digits. Anchor the pattern.
/^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$/;
Matches 1234.5678.9101112.131415. Each group has digits but no upper bound on count.
3. Bound the digit count, badly.
/^\d{3}\.\d{3}\.\d{3}\.\d{3}$/;
Matches 192.168.001.001 but rejects 1.2.3.4 — leading zeros not always written.
4. Allow 1 to 3 digits.
/^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$/;
Now matches 1.2.3.4 and rejects 1234.5.6.7. But it also matches 999.999.999.999 — beyond the 0–255 range.
5. Range-correct.
/^(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.
(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.
(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.
(?:25[0-5]|2[0-4]\d|[01]?\d\d?)$/x;
Each group is one of: 25[0-5] (250–255), 2[0-4]\d (200–249), or [01]?\d\d? (0–199 in various forms). The pattern now matches exactly the strings that name a syntactically-valid IPv4 address.
The lesson is not “memorise this regex”. It is the progression: each iteration tightened a specific kind of vagueness. The right regex for a problem is the one that admits exactly the right inputs and rejects everything else, and you only get there by asking what the previous regex actually allowed.
Cross-engine: shorthand classes#
The \d, \w, \s shorthands are not portable. The cross-engine chapter has the full table; the relevant rows extracted:
Shorthand | Perl 5.42 (default) | PCRE2 | Emacs | POSIX BRE / ERE | RE2 / Go (default) |
|---|---|---|---|---|---|
| ASCII (or Unicode under | ASCII | NO (use | NO | ASCII; Unicode under |
| ASCII or Unicode | ASCII | yes (syntax-table-driven) | NO | ASCII; Unicode under |
| ASCII or Unicode | ASCII | yes | NO | ASCII; Unicode under |
| yes | yes | yes ( | NO | yes |
| yes | yes | NO | NO | NO |
Two things to internalise:
POSIX BRE and ERE lack
\d,\w,\sentirely. Portable shell scripts use[0-9],[[:alnum:]_],[[:space:]].Emacs has
\wand\sbut no\d. Emacs’s\ssyntax is followed by a syntax-class character (\s-for whitespace,\swfor word) — unique to Emacs.
POSIX bracket classes ([[:digit:]], [[:alpha:]], …) are the universal portable spelling: every engine in the comparison recognises them.
A useful habit#
Named and shorthand classes are almost always clearer than explicit ranges. \d{4}-\d{2}-\d{2} reads; [0-9]{4}-[0-9]{2}-[0-9]{2} needs a moment. Use the ranges only when you have a concrete reason — usually performance in a hot loop, or deliberately restricting to ASCII.
See also#
The unicode chapter —
\p{…},\P{…},\X, the charset modifiers/a,/u,/l,/d.The anchors and assertions chapter —
\boutside a class.The cross-engine chapter — full table of shorthand support across engines.
m— the match operator.qr— compile a pattern for reuse.