Quantifiers#

A quantifier says “this previous thing, N times.” It attaches to a single character, a character class, or a group, and turns a single-character match into a match of many.

/a*/;       # zero or more 'a'
/\d+/;      # one or more digits
/colou?r/;  # 'color' or 'colour'

The basic quantifiers#

Quantifier	Meaning
`a?`	0 or 1 times
`a*`	0 or more times
`a+`	1 or more times
`a{n}`	exactly n times
`a{n,m}`	at least n, at most m
`a{n,}`	at least n
`a{,m}`	at most m

*, +, and ? are exact synonyms for {0,}, {1,}, and {0,1}. The brace forms are mostly used when n and m are specific small numbers; the bare forms read more naturally for “any number of these”.

The upper bound n and m cannot exceed an engine-built constant (typically 65534 - the actual limit appears in the error if you exceed it). For “any sane number” this is plenty; patterns hitting the limit are usually generated, not handwritten.

Whitespace inside {…} is tolerated but not required. a{2, 4} and a{2,4} are identical.

/[a-z]+\s+\d*/;            # word, spaces, any number of digits
/y(es)?/i;                 # 'y', 'Y', 'yes', or 'YES'
$year =~ /^\d{2,4}$/;      # 2, 3, or 4 digits
$year =~ /^\d{4}$|^\d{2}$/;# better: exactly 2 or exactly 4

The quantifier applies to the atom immediately before it:

ab? - a then optional b.
(ab)? - optional ab.
[ab]? - optional a or b.

Group or class, then quantify.

Greedy by default#

Basic quantifiers are greedy: they grab as much as the string allows and only give back if the rest of the pattern cannot match otherwise. Consider:

my $x = "the cat in the hat";
$x =~ /^(.*)(cat)(.*)$/;
# $1 = 'the '
# $2 = 'cat'
# $3 = ' in the hat'

Here .* in the middle works as you’d expect - it stops at cat because cat is the only place where the rest of the pattern can match. Now compare:

$x =~ /^(.*)(at)(.*)$/;
# $1 = 'the cat in the h'
# $2 = 'at'
# $3 = ''

There are two places where at occurs - at the end of cat and at the end of hat. The first .* grabs as much as possible while still leaving room for at somewhere, so it grabs up to the last at.

Greedy gives back only what it must#

"about 24 characters long" =~ /^.*([0-9]+)/;
# $1 = '4'

The greedy .* first matched the whole string. To make [0-9]+ match, it had to give back enough characters to expose at least one digit. It gave back exactly one, leaving 4 to capture. The intuition that .* “stops at the digits” is wrong - it grabs everything, then the engine asks how little it can return.

Greediness vs. matching#

The engine does not always match what the greedy interpretation suggests:

"27.625" =~ /(\d\d)([1-9]?)(\d+)/;
# $1 = '62'
# $2 = ''
# $3 = '5'

The first try fails: \d\d can match 27, but the next character is ., so neither branch of [1-9]? lets \d+ proceed. Earliest-position-wins keeps moving the start until \d\d lands on 62. There, [1-9]? greedily tried 5, but then \d+ had no remaining digit - so it gave the 5 back and went empty, leaving 5 for \d+. Optional quantifiers prefer the longer alternative (taking the character) but will give it up if necessary; the ordering rule “earliest position wins, then longest at that position” only constrains the whole match, not its individual captures.

Principles of the match#

With greedy quantifiers, the engine follows four principles in order:

Earliest position wins. The whole pattern is tried starting at position 0, then 1, then 2, until a match is found. The earliest starting position always wins.
Leftmost alternative wins (in a|b|c).
Greedy quantifiers grab as much as possible while still letting the rest of the pattern match.
Leftmost greedy quantifier gets priority. If two .* compete for the same characters, the first one takes them.

Examples:

my $x = "The programming republic of Perl";

$x =~ /^(.+)(e|r)(.*)$/;
# $1 = 'The programming republic of Pe'
# $2 = 'r'
# $3 = 'l'
# .+ is leftmost greedy; grabs everything it can while leaving
# room for (e|r) somewhere.

$x =~ /(m{1,2})(.*)$/;
# $1 = 'mm'    -- m{1,2} matches at first 'm' in 'programming',
# $2 = 'ing republic of Perl'
#                takes the maximum 2.

$x =~ /.*(m{1,2})(.*)$/;
# $1 = 'm'
# $2 = 'ing republic of Perl'
# .* grabs all the way to the last 'm', leaving only one 'm' for
# m{1,2}.

Non-greedy quantifiers#

Appending ? to any quantifier flips it from grab as much as possible to grab as little as possible.

Non-greedy	Meaning
`a??`	0 or 1, prefer 0
`a*?`	0 or more, prefer fewer
`a+?`	1 or more, prefer fewer
`a{n,m}?`	n to m, prefer n

The overall match still has to succeed, so the engine will expand the non-greedy quantifier one step at a time until it does.

my $x = "The programming republic of Perl";

$x =~ /^(.+?)(e|r)(.*)$/;
# $1 = 'Th'
# $2 = 'e'
# $3 = ' programming republic of Perl'
# .+? grabs as little as possible while allowing the match.

Non-greedy is usually what you want when scanning between delimiters:

my $html = '<b>bold</b> and <i>italic</i>';
while ($html =~ /<(\w+)>(.*?)<\/\1>/g) {
    print "$1: $2\n";
}
# b: bold
# i: italic

With greedy .* the match would swallow the first </b> and stop at the final </i>, mangling the capture.

When non-greedy loses to a negated class#

A common idiom: <.*?> to match a tag. A common bug: the same pattern, applied to broken input, expands the non-greedy quantifier across content it shouldn’t. The cleaner form uses a negated character class:

"<a> body </a>"  =~ /<.+?>/;        # matches '<a>'  - fine
"<a> <b>foo"     =~ /<.+?>foo/;     # matches '<a> <b>foo' - wrong
"<a> <b>foo"     =~ /<[^>]+>foo/;   # matches '<b>foo' - locked to one tag

When a token after > is required, the non-greedy form expands under backtracking pressure: the engine first tried <a>, found no foo, and forced .+? to grow across the > and beyond. The negated class cannot give ground that way - [^>] is a hard barrier, so the engine bumps the start position to <b> instead of swallowing intervening >s.

A negated class is also faster: it participates in the engine’s simple-repetition fast path, which the non-greedy form disables.

Use .+? only when no negated class can express the same requirement. Use [^X]+ whenever the delimiter is a single character.

Possessive quantifiers#

Adding + after a quantifier (not ?) makes it possessive: greedy, but it refuses to give anything back on backtracking. Once it matches, those characters are out of play for the rest of the pattern.

Possessive	Meaning
`a?+`	0 or 1, do not give back
`a*+`	0 or more, do not give back
`a++`	1 or more, do not give back
`a{n,m}+`	n to m, do not give back

The payoff is speed - on patterns that should fail, possessive quantifiers fail immediately instead of exhaustively backtracking.

# Ordinary greedy: backtracks once per character on 'abc   ' when
# the second \w+ cannot match.
/^\w+\s+\w+$/;

# Possessive: once \w+ claims the word characters, it keeps them.
# No backtracking, no quadratic blowup on pathological input.
/^\w++\s+\w+$/;

The classic 'aaaa' =~ /a++a/:

'aaaa' =~ /a++a/;   # never matches

a++ gobbles all four a’s. The trailing a then has nothing to match. Because a++ refuses to give back, the engine fails immediately. (The same pattern with greedy a+a matches 'aaaa' after backtracking.)

A common idiom for matching quoted strings without backtracking catastrophe:

/"(?:[^"\\]++|\\.)*+"/;

Each iteration of the inner group either gobbles an unlimited run of non-quote, non-backslash characters (possessively), or a single escape. Neither alternative is willing to give back, so failure is fast. The full treatment of this pattern lives in the performance chapter.

Possessive equals atomic-group sugar#

Possessive quantifiers are exact syntactic sugar for an atomic group around the quantified atom. The following are equivalent:

Possessive form	Atomic-group form
`PAT*+`	`(?>PAT*)`
`PAT++`	`(?>PAT+)`
`PAT?+`	`(?>PAT?)`
`PAT{n,m}+`	`(?>PAT{n,m})`

The possessive notation is shorter; the atomic-group notation generalises further (you can wrap arbitrary alternations, not just quantified atoms). The groups and captures chapter covers atomic groups in detail.

Illegal possessive-and-non-greedy combinations#

Possessive and non-greedy are contradictory; Perl rejects the combinations. There is an equivalent legal form for each:

Illegal	Legal equivalent
`X??+`	`X{0}`
`X+?+`	`X{1}`
`X{n,m}?+`	`X{n}`

If you find yourself wanting “non-greedy and possessive”, what you actually want is a fixed count.

Choosing the right quantifier#

Greedy - the default, and right most of the time.
Non-greedy - when scanning between a start and end marker that can both appear legitimately multiple times.
Possessive - when the match either works in exactly one way or fails, and speed on failure matters.

When in doubt, write it greedy and add a test that fails if you got it wrong.

Zero-width quantification#

A quantifier on a zero-width assertion is almost always a mistake. \b* is syntactically legal but matches the empty string at any word boundary, which is not useful. The engine rejects some of these explicitly with a “matches null string many times” warning under use warnings; for the rest, the results are usually not what the author meant.

The same caveat applies to quantified groups whose inner content can match nothing. If (\w*)* reaches an iteration where \w* matched zero characters, the outer * would loop forever. Perl breaks this loop automatically - see the performance chapter on zero-length-match termination.

Worked examples from `perlre`’s backtracking section#

Eight variants of “find the digits at the end of a string”. Each fails or succeeds for a different reason; reading them as a group is a fast way to internalise greedy and non-greedy behaviour.

$_ = "I have 2 numbers: 53147";

# Pattern              $1 captured              $2 captured
/(.*)(\d*)/         => "I have 2 numbers: 53147"  ""
/(.*)(\d+)/         => "I have 2 numbers: 5314"   "7"
/(.*?)(\d*)/        => ""                          ""
/(.*?)(\d+)/        => "I have "                   "2"
/(.*)(\d+)$/        => "I have 2 numbers: 5314"   "7"
/(.*?)(\d+)$/       => "I have 2 numbers: "       "53147"
/(.*)\b(\d+)$/      => "I have 2 numbers: "       "53147"
/(.*\D)(\d+)$/      => "I have 2 numbers: "       "53147"

Reading each:

(.*)(\d*) - greedy .* eats everything; \d* is satisfied by the empty string. Matches but neither capture is what the author wanted.
(.*)(\d+) - .* gives back enough to expose at least one digit, but only the last digit of the run.
(.*?)(\d*) - both non-greedy: prefer the empty match; both captures are empty.
(.*?)(\d+) - first non-greedy, second greedy. .*? expands until \d+ can grab 2. Stops at the first digit run.
(.*)(\d+)$ - anchored: \d+ must be at end. Greedy .* gives back digits until the last one is at end-of-string; result is the same “last-digit only” oddity.
(.*?)(\d+)$ - non-greedy and anchored: .*? expands until \d+$ can grab the whole trailing digit run. Correct.
(.*)\b(\d+)$ - anchored, with \b to force the boundary before the digit run. Equivalent.
(.*\D)(\d+)$ - explicit non-digit before the digit run. Equivalent.

The lessons:

The quantifier (.* vs .*?) controls greediness, but anchoring ($, \b, \D) is what controls which run is matched.
Greedy is rarely “wrong” by itself; the wrongness is a missing anchor.

Cross-engine notes#

Quantifier syntax is one of the most-divergent surfaces between regex engines. The full table is in the cross-engine chapter; the headline:

POSIX BRE treats bare + and ? as literal characters. Use \+ and \? for the metacharacter forms.
POSIX ERE has +, ?, {m,n} as metacharacters (Perl-shaped) but no possessive quantifiers and no atomic groups.
RE2 / Go has +, ?, *, *?, but no possessive quantifiers, no atomic groups, and no backreferences. Linear- time guarantee replaces the need for them.