--- name: regex character classes --- # Character classes A character class matches exactly one character, chosen from a set you define. Where a literal `a` matches only `a`, the class `[abc]` matches any one of `a`, `b`, or `c`. ```perl /cat/; # matches 'cat' /[bcr]at/; # matches 'bat', 'cat', or 'rat' /item[0123456789]/; # matches 'item0' through 'item9' ``` The class still consumes one character of the string. `[bcr]at` never matches `at` (no letter present) and never matches `brat` (two letters where the class expects one). ## Ranges Inside `[…]`, a dash between two characters denotes a contiguous range in the underlying character set: ```perl /[0-9]/; # any ASCII digit /[a-z]/; # any ASCII lowercase letter /[a-zA-Z]/; # any ASCII letter /[0-9a-fA-F]/; # any hex digit ``` Ranges can be combined with individual characters: ```perl /[0-9bx-z]aa/; # matches '0aa'..'9aa', 'baa', 'xaa', 'yaa', 'zaa' ``` A dash that is first or last inside the class is literal: ```perl /[-ab]/; # matches '-', 'a', or 'b' /[ab-]/; # same ``` ## Negation A caret `^` as the first character inside `[…]` inverts the class: ```perl /[^a]/; # any character except 'a' /[^0-9]/; # any non-digit ``` A caret elsewhere is literal: ```perl /[a^]/; # matches 'a' or '^' ``` A negated class still matches *one* character — `[^a]` does not match the empty string; it requires one non-`a`. ## Special characters inside a class Inside `[…]` the special set shrinks to `- ] \ ^ $` (and the pattern delimiter). The others — `.`, `*`, `+`, `?`, `(`, `)`, `{`, `}`, `|` — are literals in a class: ```perl /[.+*]/; # matches a literal '.', '+', or '*' /[()]/; # matches '(' or ')' ``` To match `]` inside the class, either escape it or put it first (after any leading `^`): ```perl /[\]]/; # matches ']' /[]ab]/; # matches ']', 'a', or 'b' ``` `$` and `\` are slightly awkward because they interact with interpolation and escaping: ```perl my $x = 'bcr'; /[$x]at/; # matches 'bat', 'cat', or 'rat' — interpolated /[\$x]at/; # matches '$at' or 'xat' — '$' is literal /[\\$x]at/; # matches '\at' plus interpolation of $x ``` ## Shorthand classes Several common classes have shorthand names usable both inside and outside `[…]`: | Shorthand | Matches | |-----------|--------------------------------------------------| | `\d` | a digit | | `\D` | a non-digit | | `\w` | a word character (alphanumeric or `_`) | | `\W` | a non-word character | | `\s` | whitespace (space, tab, `\r`, `\n`, `\f`, more) | | `\S` | non-whitespace | | `\h` | horizontal whitespace (space, tab, unicode) | | `\H` | non-horizontal-whitespace | | `\v` | vertical whitespace (`\n`, `\r`, `\f`, `\v`…) | | `\V` | non-vertical-whitespace | Under Unicode — the default since Perl 5.14 — `\d`, `\w`, `\s` match more than just ASCII. `\d` matches any Unicode digit (Devanagari digits, Arabic-Indic digits, and many more), `\w` matches any letter in any script plus marks and connector punctuation, and `\s` adds Unicode space characters such as non-breaking space. To restrict these to ASCII, add the `/a` modifier or use explicit ranges like `[0-9]` and `[A-Za-z_0-9]`. ```perl "item0" =~ /\w\w\w\w\d/; # matches "abc\x{0660}" =~ /\w\w\w\d/; # matches: U+0660 is an Arabic-Indic zero "abc\x{0660}" =~ /\w\w\w\d/a;# does not match under /a ``` ## The period `.` matches any single character except newline. Under the `/s` modifier (covered in the modifiers chapter), `.` also matches newline: ```perl "a\nb" =~ /a.b/; # does not match "a\nb" =~ /a.b/s; # matches ``` To match any character including newline without `/s`, use `\N` — it always excludes newline regardless of `/s`, but that's the opposite of what you want. Use `[\s\S]` or `[\d\D]` as the classic "match anything" idiom: ```perl "a\nb" =~ /a[\s\S]b/; # matches without /s ``` ## Composing classes You can mix shorthands, ranges, and individual characters inside one class: ```perl /[\d\s]/; # a digit or whitespace /[A-Z\d_]/; # uppercase letter, digit, or underscore /[a-zA-Z\d]/; # letter or digit (ASCII) ``` De Morgan's law matters: `[^\d\w]` is *not* `[\D\W]`. The first requires the character to be *both* non-digit and non-word. But every digit is a word character, so `[^\d\w]` simplifies to `[^\w]`, i.e. `\W`. Be careful when combining negated shorthands. ## POSIX classes POSIX character classes use the form `[:name:]` and only work inside `[…]`: | POSIX | Equivalent | |---------------|--------------------------------| | `[:alpha:]` | alphabetic | | `[:alnum:]` | alphanumeric | | `[:digit:]` | digit (like `\d`) | | `[:word:]` | word char (Perl extension) | | `[:space:]` | whitespace (like `\s`) | | `[:upper:]` | uppercase | | `[:lower:]` | lowercase | | `[:xdigit:]` | hex digit | | `[:ascii:]` | 0x00–0x7F | | `[:cntrl:]` | control character | | `[:graph:]` | printable, not space | | `[:print:]` | printable, including space | | `[:punct:]` | punctuation | | `[:blank:]` | space or tab | Negate a POSIX class with `^` *inside* the colons: ```perl /[[:^digit:]]/; # same as \D /[[:alpha:][:digit:]]/; # letter or digit — equivalent to \w minus '_' ``` POSIX classes follow the same Unicode-vs-ASCII rules as the shorthands: without `/a`, `[:alpha:]` is the Unicode alphabetic set. ## Unicode properties Unicode defines thousands of properties. The notation is `\p{Name}` for "has this property" and `\P{Name}` for "does not have this property". ```perl /\p{Lu}/; # any uppercase letter, any script /\p{Greek}/; # any character in the Greek script /\p{Number}/; # any numeric character /\P{ASCII}/; # any non-ASCII character ``` Short single-letter aliases exist for common properties and drop the braces: `\pL` is a letter, `\pN` a number, `\pP` punctuation, and so on. `\p{L}` is the same as `\pL`. The Unicode chapter covers properties in detail, including the compound form `\p{Name=Value}` and the `\X` grapheme cluster. ## A useful habit Named and shorthand classes are almost always clearer than explicit ranges. `\d{4}-\d{2}-\d{2}` reads; `[0-9]{4}-[0-9]{2}-[0-9]{2}` needs a moment. Use the ranges only when you have a concrete reason — usually performance in a hot loop, or deliberately restricting to ASCII. ## See also - [`perlre`](../../p5/core/perlre) — full class syntax, including class set operations `[a&&b]`, `[a+b]`, `[a-b]`. - [`m`](../../p5/core/perlfunc/m) — the match operator. - [`qr`](../../p5/core/perlfunc/qr) — compile a pattern for reuse.