--- name: regex quantifiers --- # Quantifiers A quantifier says "this previous thing, N times." It attaches to a single character, a character class, or a group, and turns a single-character match into a match of many. ```perl /a*/; # zero or more 'a' /\d+/; # one or more digits /colou?r/; # 'color' or 'colour' ``` ## The four basic quantifiers | Quantifier | Meaning | |------------|------------------------------------------| | `a?` | 0 or 1 times | | `a*` | 0 or more times | | `a+` | 1 or more times | | `a{n}` | exactly n times | | `a{n,m}` | at least n, at most m | | `a{n,}` | at least n | | `a{,m}` | at most m | Whitespace inside `{…}` is tolerated but not required. `a{2, 4}` and `a{2,4}` are identical. ```perl /[a-z]+\s+\d*/; # word, spaces, any number of digits /y(es)?/i; # 'y', 'Y', 'yes', or 'YES' $year =~ /^\d{2,4}$/; # 2, 3, or 4 digits $year =~ /^\d{4}$|^\d{2}$/;# better: exactly 2 or exactly 4 ``` The quantifier applies to the atom immediately before it: - `ab?` — `a` then optional `b`. - `(ab)?` — optional `ab`. - `[ab]?` — optional `a` or `b`. Group or class, then quantify. ## Greedy by default Basic quantifiers are *greedy*: they grab as much as the string allows and only give back if the rest of the pattern cannot match otherwise. Consider: ```perl my $x = "the cat in the hat"; $x =~ /^(.*)(cat)(.*)$/; # $1 = 'the ' # $2 = 'cat' # $3 = ' in the hat' ``` Here `.*` in the middle works as you'd expect — it stops at `cat` because `cat` is the only place where the rest of the pattern can match. Now compare: ```perl $x =~ /^(.*)(at)(.*)$/; # $1 = 'the cat in the h' # $2 = 'at' # $3 = '' ``` There are *two* places where `at` occurs — at the end of `cat` and at the end of `hat`. The first `.*` grabs as much as possible while still leaving room for `at` somewhere, so it grabs up to the last `at`. The leftmost quantifier wins the contest. ## Principles of the match With greedy quantifiers, the engine follows four principles in order: 1. **Earliest position wins.** The whole pattern is tried starting at position 0, then 1, then 2, until a match is found. The earliest starting position always wins. 2. **Leftmost alternative wins** (in `a|b|c`). 3. **Greedy quantifiers grab as much as possible** while still letting the rest of the pattern match. 4. **Leftmost greedy quantifier gets priority.** If two `.*` compete for the same characters, the first one takes them. Examples: ```perl my $x = "The programming republic of Perl"; $x =~ /^(.+)(e|r)(.*)$/; # $1 = 'The programming republic of Pe' # $2 = 'r' # $3 = 'l' # .+ is leftmost greedy; grabs everything it can while leaving # room for (e|r) somewhere. $x =~ /(m{1,2})(.*)$/; # $1 = 'mm' -- m{1,2} matches at first 'm' in 'programming', # $2 = 'ing republic of Perl' # takes the maximum 2. $x =~ /.*(m{1,2})(.*)$/; # $1 = 'm' # $2 = 'ing republic of Perl' # .* grabs all the way to the last 'm', leaving only one 'm' for # m{1,2}. ``` ## Non-greedy quantifiers Appending `?` to any quantifier flips it from *grab as much as possible* to *grab as little as possible*. | Non-greedy | Meaning | |------------|----------------------------------------------| | `a??` | 0 or 1, prefer 0 | | `a*?` | 0 or more, prefer fewer | | `a+?` | 1 or more, prefer fewer | | `a{n,m}?` | n to m, prefer n | The overall match still has to succeed, so the engine will *expand* the non-greedy quantifier one step at a time until it does. ```perl my $x = "The programming republic of Perl"; $x =~ /^(.+?)(e|r)(.*)$/; # $1 = 'Th' # $2 = 'e' # $3 = ' programming republic of Perl' # .+? grabs as little as possible while allowing the match. ``` Non-greedy is usually what you want when scanning between delimiters: ```perl my $html = 'bold and italic'; while ($html =~ /<(\w+)>(.*?)<\/\1>/g) { print "$1: $2\n"; } # b: bold # i: italic ``` With greedy `.*` the match would swallow the first `` and stop at the final ``, mangling the capture. ## Possessive quantifiers Adding `+` after a quantifier (not `?`) makes it *possessive*: it is greedy, but it refuses to give anything back on backtracking. Once it matches, those characters are out of play for the rest of the pattern. | Possessive | Meaning | |------------|----------------------------------------------| | `a?+` | 0 or 1, do not give back | | `a*+` | 0 or more, do not give back | | `a++` | 1 or more, do not give back | | `a{n,m}+` | n to m, do not give back | The payoff is speed — on patterns that *should* fail, possessive quantifiers fail *immediately* instead of exhaustively backtracking. ```perl # Ordinary greedy: backtracks once per character on 'abc ' when # the second \w+ cannot match. /^\w+\s+\w+$/; # Possessive: once \w+ claims the word characters, it keeps them. # No backtracking, no quadratic blowup on pathological input. /^\w++\s+\w+$/; ``` A common idiom for matching quoted strings without backtracking catastrophe: ```perl /"(?:[^"\\]++|\\.)*+"/; ``` Each iteration of the inner group either gobbles an unlimited run of non-quote, non-backslash characters (possessively), or a single escape. Neither alternative is willing to give back, so failure is fast. Possessive quantifiers are covered further in the performance chapter. They are not always equivalent to their greedy cousins — if the greedy form *does* need to give back, the possessive form will fail the match. Use them when you can prove nothing later in the pattern depends on those characters. ## Zero-width quantification A quantifier on a zero-width assertion is almost always a mistake. `\b*` is syntactically legal but matches the empty string at any word boundary, which is not useful. The engine rejects some of these explicitly; for the rest, results are usually not what the author meant. ## Choosing the right quantifier - **Greedy** — the default, and right most of the time. - **Non-greedy** — when scanning between a start and end marker that can both appear legitimately multiple times. - **Possessive** — when the match either works in exactly one way or fails, and speed on failure matters. When in doubt, write it greedy and add a test that fails if you got it wrong. ## See also - [`perlre`](../../p5/core/perlre) — complete syntax, including the `{n,m}+` corner cases. - The [performance](performance) chapter — nested quantifiers and catastrophic backtracking.