Character classes#

A character class matches exactly one character, chosen from a set you define. Where a literal a matches only a, the class [abc] matches any one of a, b, or c.

/cat/;               # matches 'cat'
/[bcr]at/;           # matches 'bat', 'cat', or 'rat'
/item[0123456789]/;  # matches 'item0' through 'item9'

The class still consumes one character of the string. [bcr]at never matches at (no letter present) and never matches brat (two letters where the class expects one).

Ranges#

Inside […], a dash between two characters denotes a contiguous range in the underlying character set:

/[0-9]/;         # any ASCII digit
/[a-z]/;         # any ASCII lowercase letter
/[a-zA-Z]/;      # any ASCII letter
/[0-9a-fA-F]/;   # any hex digit

Ranges can be combined with individual characters:

/[0-9bx-z]aa/;   # matches '0aa'..'9aa', 'baa', 'xaa', 'yaa', 'zaa'

A dash that is first or last inside the class is literal:

/[-ab]/;         # matches '-', 'a', or 'b'
/[ab-]/;         # same

Negation#

A caret ^ as the first character inside […] inverts the class:

/[^a]/;          # any character except 'a'
/[^0-9]/;        # any non-digit

A caret elsewhere is literal:

/[a^]/;          # matches 'a' or '^'

A negated class still matches one character — [^a] does not match the empty string; it requires one non-a.

Special characters inside a class#

Inside […] the special set shrinks to - ] \ ^ $ (and the pattern delimiter). The others — ., *, +, ?, (, ), {, }, | — are literals in a class:

/[.+*]/;         # matches a literal '.', '+', or '*'
/[()]/;          # matches '(' or ')'

To match ] inside the class, either escape it or put it first (after any leading ^):

/[\]]/;          # matches ']'
/[]ab]/;         # matches ']', 'a', or 'b'

$ and \ are slightly awkward because they interact with interpolation and escaping:

my $x = 'bcr';
/[$x]at/;        # matches 'bat', 'cat', or 'rat' — interpolated
/[\$x]at/;       # matches '$at' or 'xat' — '$' is literal
/[\\$x]at/;      # matches '\at' plus interpolation of $x

Shorthand classes#

Several common classes have shorthand names usable both inside and outside […]:

Shorthand

Matches

\d

a digit

\D

a non-digit

\w

a word character (alphanumeric or _)

\W

a non-word character

\s

whitespace (space, tab, \r, \n, \f, more)

\S

non-whitespace

\h

horizontal whitespace (space, tab, unicode)

\H

non-horizontal-whitespace

\v

vertical whitespace (\n, \r, \f, \v…)

\V

non-vertical-whitespace

Under Unicode — the default since Perl 5.14 — \d, \w, \s match more than just ASCII. \d matches any Unicode digit (Devanagari digits, Arabic-Indic digits, and many more), \w matches any letter in any script plus marks and connector punctuation, and \s adds Unicode space characters such as non-breaking space.

To restrict these to ASCII, add the /a modifier or use explicit ranges like [0-9] and [A-Za-z_0-9].

"item0" =~ /\w\w\w\w\d/;     # matches
"abc\x{0660}" =~ /\w\w\w\d/; # matches: U+0660 is an Arabic-Indic zero
"abc\x{0660}" =~ /\w\w\w\d/a;# does not match under /a

The period#

. matches any single character except newline. Under the /s modifier (covered in the modifiers chapter), . also matches newline:

"a\nb" =~ /a.b/;        # does not match
"a\nb" =~ /a.b/s;       # matches

To match any character including newline without /s, use \N — it always excludes newline regardless of /s, but that’s the opposite of what you want. Use [\s\S] or [\d\D] as the classic “match anything” idiom:

"a\nb" =~ /a[\s\S]b/;   # matches without /s

Composing classes#

You can mix shorthands, ranges, and individual characters inside one class:

/[\d\s]/;        # a digit or whitespace
/[A-Z\d_]/;      # uppercase letter, digit, or underscore
/[a-zA-Z\d]/;    # letter or digit (ASCII)

De Morgan’s law matters: [^\d\w] is not [\D\W]. The first requires the character to be both non-digit and non-word. But every digit is a word character, so [^\d\w] simplifies to [^\w], i.e. \W. Be careful when combining negated shorthands.

POSIX classes#

POSIX character classes use the form [:name:] and only work inside […]:

POSIX

Equivalent

[:alpha:]

alphabetic

[:alnum:]

alphanumeric

[:digit:]

digit (like \d)

[:word:]

word char (Perl extension)

[:space:]

whitespace (like \s)

[:upper:]

uppercase

[:lower:]

lowercase

[:xdigit:]

hex digit

[:ascii:]

0x00–0x7F

[:cntrl:]

control character

[:graph:]

printable, not space

[:print:]

printable, including space

[:punct:]

punctuation

[:blank:]

space or tab

Negate a POSIX class with ^ inside the colons:

/[[:^digit:]]/;   # same as \D
/[[:alpha:][:digit:]]/;  # letter or digit — equivalent to \w minus '_'

POSIX classes follow the same Unicode-vs-ASCII rules as the shorthands: without /a, [:alpha:] is the Unicode alphabetic set.

Unicode properties#

Unicode defines thousands of properties. The notation is \p{Name} for “has this property” and \P{Name} for “does not have this property”.

/\p{Lu}/;              # any uppercase letter, any script
/\p{Greek}/;           # any character in the Greek script
/\p{Number}/;          # any numeric character
/\P{ASCII}/;           # any non-ASCII character

Short single-letter aliases exist for common properties and drop the braces: \pL is a letter, \pN a number, \pP punctuation, and so on. \p{L} is the same as \pL.

The Unicode chapter covers properties in detail, including the compound form \p{Name=Value} and the \X grapheme cluster.

A useful habit#

Named and shorthand classes are almost always clearer than explicit ranges. \d{4}-\d{2}-\d{2} reads; [0-9]{4}-[0-9]{2}-[0-9]{2} needs a moment. Use the ranges only when you have a concrete reason — usually performance in a hot loop, or deliberately restricting to ASCII.

See also#

  • perlre — full class syntax, including class set operations [a&&b], [a+b], [a-b].

  • m — the match operator.

  • qr — compile a pattern for reuse.