Quantifiers#
A quantifier says “this previous thing, N times.” It attaches to a single character, a character class, or a group, and turns a single-character match into a match of many.
/a*/; # zero or more 'a'
/\d+/; # one or more digits
/colou?r/; # 'color' or 'colour'
The basic quantifiers#
Quantifier | Meaning |
|---|---|
| 0 or 1 times |
| 0 or more times |
| 1 or more times |
| exactly n times |
| at least n, at most m |
| at least n |
| at most m |
*, +, and ? are exact synonyms for {0,}, {1,}, and {0,1}. The brace forms are mostly used when n and m are specific small numbers; the bare forms read more naturally for “any number of these”.
The upper bound n and m cannot exceed an engine-built constant (typically 65534 — the actual limit appears in the error if you exceed it). For “any sane number” this is plenty; patterns hitting the limit are usually generated, not handwritten.
Whitespace inside {…} is tolerated but not required. a{2, 4} and a{2,4} are identical.
/[a-z]+\s+\d*/; # word, spaces, any number of digits
/y(es)?/i; # 'y', 'Y', 'yes', or 'YES'
$year =~ /^\d{2,4}$/; # 2, 3, or 4 digits
$year =~ /^\d{4}$|^\d{2}$/;# better: exactly 2 or exactly 4
The quantifier applies to the atom immediately before it:
ab?—athen optionalb.(ab)?— optionalab.[ab]?— optionalaorb.
Group or class, then quantify.
Greedy by default#
Basic quantifiers are greedy: they grab as much as the string allows and only give back if the rest of the pattern cannot match otherwise. Consider:
my $x = "the cat in the hat";
$x =~ /^(.*)(cat)(.*)$/;
# $1 = 'the '
# $2 = 'cat'
# $3 = ' in the hat'
Here .* in the middle works as you’d expect — it stops at cat because cat is the only place where the rest of the pattern can match. Now compare:
$x =~ /^(.*)(at)(.*)$/;
# $1 = 'the cat in the h'
# $2 = 'at'
# $3 = ''
There are two places where at occurs — at the end of cat and at the end of hat. The first .* grabs as much as possible while still leaving room for at somewhere, so it grabs up to the last at.
Greedy gives back only what it must#
"about 24 characters long" =~ /^.*([0-9]+)/;
# $1 = '4'
The greedy .* first matched the whole string. To make [0-9]+ match, it had to give back enough characters to expose at least one digit. It gave back exactly one, leaving 4 to capture. The intuition that .* “stops at the digits” is wrong — it grabs everything, then the engine asks how little it can return.
Greediness vs. matching#
The engine does not always match what the greedy interpretation suggests:
"27.625" =~ /(\d\d)([1-9]?)(\d+)/;
# $1 = '62'
# $2 = ''
# $3 = '5'
The first try fails: \d\d can match 27, but the next character is ., so neither branch of [1-9]? lets \d+ proceed. Earliest-position-wins keeps moving the start until \d\d lands on 62. There, [1-9]? greedily tried 5, but then \d+ had no remaining digit — so it gave the 5 back and went empty, leaving 5 for \d+. Optional quantifiers prefer the longer alternative (taking the character) but will give it up if necessary; the ordering rule “earliest position wins, then longest at that position” only constrains the whole match, not its individual captures.
Principles of the match#
With greedy quantifiers, the engine follows four principles in order:
Earliest position wins. The whole pattern is tried starting at position 0, then 1, then 2, until a match is found. The earliest starting position always wins.
Leftmost alternative wins (in
a|b|c).Greedy quantifiers grab as much as possible while still letting the rest of the pattern match.
Leftmost greedy quantifier gets priority. If two
.*compete for the same characters, the first one takes them.
Examples:
my $x = "The programming republic of Perl";
$x =~ /^(.+)(e|r)(.*)$/;
# $1 = 'The programming republic of Pe'
# $2 = 'r'
# $3 = 'l'
# .+ is leftmost greedy; grabs everything it can while leaving
# room for (e|r) somewhere.
$x =~ /(m{1,2})(.*)$/;
# $1 = 'mm' -- m{1,2} matches at first 'm' in 'programming',
# $2 = 'ing republic of Perl'
# takes the maximum 2.
$x =~ /.*(m{1,2})(.*)$/;
# $1 = 'm'
# $2 = 'ing republic of Perl'
# .* grabs all the way to the last 'm', leaving only one 'm' for
# m{1,2}.
Non-greedy quantifiers#
Appending ? to any quantifier flips it from grab as much as possible to grab as little as possible.
Non-greedy | Meaning |
|---|---|
| 0 or 1, prefer 0 |
| 0 or more, prefer fewer |
| 1 or more, prefer fewer |
| n to m, prefer n |
The overall match still has to succeed, so the engine will expand the non-greedy quantifier one step at a time until it does.
my $x = "The programming republic of Perl";
$x =~ /^(.+?)(e|r)(.*)$/;
# $1 = 'Th'
# $2 = 'e'
# $3 = ' programming republic of Perl'
# .+? grabs as little as possible while allowing the match.
Non-greedy is usually what you want when scanning between delimiters:
my $html = '<b>bold</b> and <i>italic</i>';
while ($html =~ /<(\w+)>(.*?)<\/\1>/g) {
print "$1: $2\n";
}
# b: bold
# i: italic
With greedy .* the match would swallow the first </b> and stop at the final </i>, mangling the capture.
When non-greedy loses to a negated class#
A common idiom: <.*?> to match a tag. A common bug: the same pattern, applied to broken input, expands the non-greedy quantifier across content it shouldn’t. The cleaner form uses a negated character class:
"<a> body </a>" =~ /<.+?>/; # matches '<a>' — fine
"<a> <b>foo" =~ /<.+?>foo/; # matches '<a> <b>foo' — wrong
"<a> <b>foo" =~ /<[^>]+>foo/; # matches '<b>foo' — locked to one tag
When a token after > is required, the non-greedy form expands under backtracking pressure: the engine first tried <a>, found no foo, and forced .+? to grow across the > and beyond. The negated class cannot give ground that way — [^>] is a hard barrier, so the engine bumps the start position to <b> instead of swallowing intervening >s.
A negated class is also faster: it participates in the engine’s simple-repetition fast path, which the non-greedy form disables.
Use .+? only when no negated class can express the same requirement. Use [^X]+ whenever the delimiter is a single character.
Possessive quantifiers#
Adding + after a quantifier (not ?) makes it possessive: greedy, but it refuses to give anything back on backtracking. Once it matches, those characters are out of play for the rest of the pattern.
Possessive | Meaning |
|---|---|
| 0 or 1, do not give back |
| 0 or more, do not give back |
| 1 or more, do not give back |
| n to m, do not give back |
The payoff is speed — on patterns that should fail, possessive quantifiers fail immediately instead of exhaustively backtracking.
# Ordinary greedy: backtracks once per character on 'abc ' when
# the second \w+ cannot match.
/^\w+\s+\w+$/;
# Possessive: once \w+ claims the word characters, it keeps them.
# No backtracking, no quadratic blowup on pathological input.
/^\w++\s+\w+$/;
The classic 'aaaa' =~ /a++a/:
'aaaa' =~ /a++a/; # never matches
a++ gobbles all four a’s. The trailing a then has nothing to match. Because a++ refuses to give back, the engine fails immediately. (The same pattern with greedy a+a matches 'aaaa' after backtracking.)
A common idiom for matching quoted strings without backtracking catastrophe:
/"(?:[^"\\]++|\\.)*+"/;
Each iteration of the inner group either gobbles an unlimited run of non-quote, non-backslash characters (possessively), or a single escape. Neither alternative is willing to give back, so failure is fast. The full treatment of this pattern lives in the performance chapter.
Possessive equals atomic-group sugar#
Possessive quantifiers are exact syntactic sugar for an atomic group around the quantified atom. The following are equivalent:
Possessive form | Atomic-group form |
|---|---|
|
|
|
|
|
|
|
|
The possessive notation is shorter; the atomic-group notation generalises further (you can wrap arbitrary alternations, not just quantified atoms). The groups and captures chapter covers atomic groups in detail.
Illegal possessive-and-non-greedy combinations#
Possessive and non-greedy are contradictory; Perl rejects the combinations. There is an equivalent legal form for each:
Illegal | Legal equivalent |
|---|---|
|
|
|
|
|
|
If you find yourself wanting “non-greedy and possessive”, what you actually want is a fixed count.
Choosing the right quantifier#
Greedy — the default, and right most of the time.
Non-greedy — when scanning between a start and end marker that can both appear legitimately multiple times.
Possessive — when the match either works in exactly one way or fails, and speed on failure matters.
When in doubt, write it greedy and add a test that fails if you got it wrong.
Zero-width quantification#
A quantifier on a zero-width assertion is almost always a mistake. \b* is syntactically legal but matches the empty string at any word boundary, which is not useful. The engine rejects some of these explicitly with a “matches null string many times” warning under use warnings; for the rest, the results are usually not what the author meant.
The same caveat applies to quantified groups whose inner content can match nothing. If (\w*)* reaches an iteration where \w* matched zero characters, the outer * would loop forever. Perl breaks this loop automatically — see the performance chapter on zero-length-match termination.
Worked examples from perlre’s backtracking section#
Eight variants of “find the digits at the end of a string”. Each fails or succeeds for a different reason; reading them as a group is a fast way to internalise greedy and non-greedy behaviour.
$_ = "I have 2 numbers: 53147";
# Pattern $1 captured $2 captured
/(.*)(\d*)/ => "I have 2 numbers: 53147" ""
/(.*)(\d+)/ => "I have 2 numbers: 5314" "7"
/(.*?)(\d*)/ => "" ""
/(.*?)(\d+)/ => "I have " "2"
/(.*)(\d+)$/ => "I have 2 numbers: 5314" "7"
/(.*?)(\d+)$/ => "I have 2 numbers: " "53147"
/(.*)\b(\d+)$/ => "I have 2 numbers: " "53147"
/(.*\D)(\d+)$/ => "I have 2 numbers: " "53147"
Reading each:
(.*)(\d*)— greedy.*eats everything;\d*is satisfied by the empty string. Matches but neither capture is what the author wanted.(.*)(\d+)—.*gives back enough to expose at least one digit, but only the last digit of the run.(.*?)(\d*)— both non-greedy: prefer the empty match; both captures are empty.(.*?)(\d+)— first non-greedy, second greedy..*?expands until\d+can grab2. Stops at the first digit run.(.*)(\d+)$— anchored:\d+must be at end. Greedy.*gives back digits until the last one is at end-of-string; result is the same “last-digit only” oddity.(.*?)(\d+)$— non-greedy and anchored:.*?expands until\d+$can grab the whole trailing digit run. Correct.(.*)\b(\d+)$— anchored, with\bto force the boundary before the digit run. Equivalent.(.*\D)(\d+)$— explicit non-digit before the digit run. Equivalent.
The lessons:
The quantifier (
.*vs.*?) controls greediness, but anchoring ($,\b,\D) is what controls which run is matched.Greedy is rarely “wrong” by itself; the wrongness is a missing anchor.
Cross-engine notes#
Quantifier syntax is one of the most-divergent surfaces between regex engines. The full table is in the cross-engine chapter; the headline:
POSIX BRE treats bare
+and?as literal characters. Use\+and\?for the metacharacter forms.POSIX ERE has
+,?,{m,n}as metacharacters (Perl-shaped) but no possessive quantifiers and no atomic groups.RE2 / Go has
+,?,*,*?, but no possessive quantifiers, no atomic groups, and no backreferences. Linear- time guarantee replaces the need for them.
See also#
The performance chapter — catastrophic backtracking, unrolling the loop, and where possessive quantifiers earn their keep.
The groups and captures chapter — atomic groups
(?>…), the general form of “no backtracking inside this construct”.The character classes chapter —
[^X]+as a faster, safer alternative to.*?.The cross-engine chapter — quantifier availability across families.