Character classes#
A character class matches exactly one character, chosen from a set you
define. Where a literal a matches only a, the class [abc]
matches any one of a, b, or c.
/cat/; # matches 'cat'
/[bcr]at/; # matches 'bat', 'cat', or 'rat'
/item[0123456789]/; # matches 'item0' through 'item9'
The class still consumes one character of the string. [bcr]at never
matches at (no letter present) and never matches brat (two
letters where the class expects one).
Ranges#
Inside […], a dash between two characters denotes a contiguous
range in the underlying character set:
/[0-9]/; # any ASCII digit
/[a-z]/; # any ASCII lowercase letter
/[a-zA-Z]/; # any ASCII letter
/[0-9a-fA-F]/; # any hex digit
Ranges can be combined with individual characters:
/[0-9bx-z]aa/; # matches '0aa'..'9aa', 'baa', 'xaa', 'yaa', 'zaa'
A dash that is first or last inside the class is literal:
/[-ab]/; # matches '-', 'a', or 'b'
/[ab-]/; # same
Negation#
A caret ^ as the first character inside […] inverts the class:
/[^a]/; # any character except 'a'
/[^0-9]/; # any non-digit
A caret elsewhere is literal:
/[a^]/; # matches 'a' or '^'
A negated class still matches one character — [^a] does not match
the empty string; it requires one non-a.
Special characters inside a class#
Inside […] the special set shrinks to - ] \ ^ $ (and the pattern
delimiter). The others — ., *, +, ?, (, ), {, }, |
— are literals in a class:
/[.+*]/; # matches a literal '.', '+', or '*'
/[()]/; # matches '(' or ')'
To match ] inside the class, either escape it or put it first
(after any leading ^):
/[\]]/; # matches ']'
/[]ab]/; # matches ']', 'a', or 'b'
$ and \ are slightly awkward because they interact with
interpolation and escaping:
my $x = 'bcr';
/[$x]at/; # matches 'bat', 'cat', or 'rat' — interpolated
/[\$x]at/; # matches '$at' or 'xat' — '$' is literal
/[\\$x]at/; # matches '\at' plus interpolation of $x
Shorthand classes#
Several common classes have shorthand names usable both inside and
outside […]:
Shorthand |
Matches |
|---|---|
|
a digit |
|
a non-digit |
|
a word character (alphanumeric or |
|
a non-word character |
|
whitespace (space, tab, |
|
non-whitespace |
|
horizontal whitespace (space, tab, unicode) |
|
non-horizontal-whitespace |
|
vertical whitespace ( |
|
non-vertical-whitespace |
Under Unicode — the default since Perl 5.14 — \d, \w, \s match
more than just ASCII. \d matches any Unicode digit (Devanagari
digits, Arabic-Indic digits, and many more), \w matches any letter
in any script plus marks and connector punctuation, and \s adds
Unicode space characters such as non-breaking space.
To restrict these to ASCII, add the /a modifier or use explicit
ranges like [0-9] and [A-Za-z_0-9].
"item0" =~ /\w\w\w\w\d/; # matches
"abc\x{0660}" =~ /\w\w\w\d/; # matches: U+0660 is an Arabic-Indic zero
"abc\x{0660}" =~ /\w\w\w\d/a;# does not match under /a
The period#
. matches any single character except newline. Under the /s
modifier (covered in the modifiers chapter), . also matches
newline:
"a\nb" =~ /a.b/; # does not match
"a\nb" =~ /a.b/s; # matches
To match any character including newline without /s, use \N — it
always excludes newline regardless of /s, but that’s the opposite
of what you want. Use [\s\S] or [\d\D] as the classic
“match anything” idiom:
"a\nb" =~ /a[\s\S]b/; # matches without /s
Composing classes#
You can mix shorthands, ranges, and individual characters inside one class:
/[\d\s]/; # a digit or whitespace
/[A-Z\d_]/; # uppercase letter, digit, or underscore
/[a-zA-Z\d]/; # letter or digit (ASCII)
De Morgan’s law matters: [^\d\w] is not [\D\W]. The first
requires the character to be both non-digit and non-word. But every
digit is a word character, so [^\d\w] simplifies to [^\w], i.e.
\W. Be careful when combining negated shorthands.
POSIX classes#
POSIX character classes use the form [:name:] and only work inside
[…]:
POSIX |
Equivalent |
|---|---|
|
alphabetic |
|
alphanumeric |
|
digit (like |
|
word char (Perl extension) |
|
whitespace (like |
|
uppercase |
|
lowercase |
|
hex digit |
|
0x00–0x7F |
|
control character |
|
printable, not space |
|
printable, including space |
|
punctuation |
|
space or tab |
Negate a POSIX class with ^ inside the colons:
/[[:^digit:]]/; # same as \D
/[[:alpha:][:digit:]]/; # letter or digit — equivalent to \w minus '_'
POSIX classes follow the same Unicode-vs-ASCII rules as the
shorthands: without /a, [:alpha:] is the Unicode alphabetic set.
Unicode properties#
Unicode defines thousands of properties. The notation is
\p{Name} for “has this property” and \P{Name} for “does not
have this property”.
/\p{Lu}/; # any uppercase letter, any script
/\p{Greek}/; # any character in the Greek script
/\p{Number}/; # any numeric character
/\P{ASCII}/; # any non-ASCII character
Short single-letter aliases exist for common properties and drop the
braces: \pL is a letter, \pN a number, \pP punctuation, and so
on. \p{L} is the same as \pL.
The Unicode chapter covers properties in detail, including the
compound form \p{Name=Value} and the \X grapheme cluster.
A useful habit#
Named and shorthand classes are almost always clearer than explicit
ranges. \d{4}-\d{2}-\d{2} reads; [0-9]{4}-[0-9]{2}-[0-9]{2}
needs a moment. Use the ranges only when you have a concrete reason
— usually performance in a hot loop, or deliberately restricting to
ASCII.