Quantifiers#
A quantifier says “this previous thing, N times.” It attaches to a single character, a character class, or a group, and turns a single-character match into a match of many.
/a*/; # zero or more 'a'
/\d+/; # one or more digits
/colou?r/; # 'color' or 'colour'
The four basic quantifiers#
Quantifier |
Meaning |
|---|---|
|
0 or 1 times |
|
0 or more times |
|
1 or more times |
|
exactly n times |
|
at least n, at most m |
|
at least n |
|
at most m |
Whitespace inside {…} is tolerated but not required. a{2, 4} and
a{2,4} are identical.
/[a-z]+\s+\d*/; # word, spaces, any number of digits
/y(es)?/i; # 'y', 'Y', 'yes', or 'YES'
$year =~ /^\d{2,4}$/; # 2, 3, or 4 digits
$year =~ /^\d{4}$|^\d{2}$/;# better: exactly 2 or exactly 4
The quantifier applies to the atom immediately before it:
ab?—athen optionalb.(ab)?— optionalab.[ab]?— optionalaorb.
Group or class, then quantify.
Greedy by default#
Basic quantifiers are greedy: they grab as much as the string allows and only give back if the rest of the pattern cannot match otherwise. Consider:
my $x = "the cat in the hat";
$x =~ /^(.*)(cat)(.*)$/;
# $1 = 'the '
# $2 = 'cat'
# $3 = ' in the hat'
Here .* in the middle works as you’d expect — it stops at cat
because cat is the only place where the rest of the pattern can
match. Now compare:
$x =~ /^(.*)(at)(.*)$/;
# $1 = 'the cat in the h'
# $2 = 'at'
# $3 = ''
There are two places where at occurs — at the end of cat and
at the end of hat. The first .* grabs as much as possible while
still leaving room for at somewhere, so it grabs up to the last
at. The leftmost quantifier wins the contest.
Principles of the match#
With greedy quantifiers, the engine follows four principles in order:
Earliest position wins. The whole pattern is tried starting at position 0, then 1, then 2, until a match is found. The earliest starting position always wins.
Leftmost alternative wins (in
a|b|c).Greedy quantifiers grab as much as possible while still letting the rest of the pattern match.
Leftmost greedy quantifier gets priority. If two
.*compete for the same characters, the first one takes them.
Examples:
my $x = "The programming republic of Perl";
$x =~ /^(.+)(e|r)(.*)$/;
# $1 = 'The programming republic of Pe'
# $2 = 'r'
# $3 = 'l'
# .+ is leftmost greedy; grabs everything it can while leaving
# room for (e|r) somewhere.
$x =~ /(m{1,2})(.*)$/;
# $1 = 'mm' -- m{1,2} matches at first 'm' in 'programming',
# $2 = 'ing republic of Perl'
# takes the maximum 2.
$x =~ /.*(m{1,2})(.*)$/;
# $1 = 'm'
# $2 = 'ing republic of Perl'
# .* grabs all the way to the last 'm', leaving only one 'm' for
# m{1,2}.
Non-greedy quantifiers#
Appending ? to any quantifier flips it from grab as much as
possible to grab as little as possible.
Non-greedy |
Meaning |
|---|---|
|
0 or 1, prefer 0 |
|
0 or more, prefer fewer |
|
1 or more, prefer fewer |
|
n to m, prefer n |
The overall match still has to succeed, so the engine will expand the non-greedy quantifier one step at a time until it does.
my $x = "The programming republic of Perl";
$x =~ /^(.+?)(e|r)(.*)$/;
# $1 = 'Th'
# $2 = 'e'
# $3 = ' programming republic of Perl'
# .+? grabs as little as possible while allowing the match.
Non-greedy is usually what you want when scanning between delimiters:
my $html = '<b>bold</b> and <i>italic</i>';
while ($html =~ /<(\w+)>(.*?)<\/\1>/g) {
print "$1: $2\n";
}
# b: bold
# i: italic
With greedy .* the match would swallow the first </b> and stop
at the final </i>, mangling the capture.
Possessive quantifiers#
Adding + after a quantifier (not ?) makes it possessive: it is
greedy, but it refuses to give anything back on backtracking. Once
it matches, those characters are out of play for the rest of the
pattern.
Possessive |
Meaning |
|---|---|
|
0 or 1, do not give back |
|
0 or more, do not give back |
|
1 or more, do not give back |
|
n to m, do not give back |
The payoff is speed — on patterns that should fail, possessive quantifiers fail immediately instead of exhaustively backtracking.
# Ordinary greedy: backtracks once per character on 'abc ' when
# the second \w+ cannot match.
/^\w+\s+\w+$/;
# Possessive: once \w+ claims the word characters, it keeps them.
# No backtracking, no quadratic blowup on pathological input.
/^\w++\s+\w+$/;
A common idiom for matching quoted strings without backtracking catastrophe:
/"(?:[^"\\]++|\\.)*+"/;
Each iteration of the inner group either gobbles an unlimited run of non-quote, non-backslash characters (possessively), or a single escape. Neither alternative is willing to give back, so failure is fast.
Possessive quantifiers are covered further in the performance chapter. They are not always equivalent to their greedy cousins — if the greedy form does need to give back, the possessive form will fail the match. Use them when you can prove nothing later in the pattern depends on those characters.
Zero-width quantification#
A quantifier on a zero-width assertion is almost always a mistake.
\b* is syntactically legal but matches the empty string at any
word boundary, which is not useful. The engine rejects some of these
explicitly; for the rest, results are usually not what the author
meant.
Choosing the right quantifier#
Greedy — the default, and right most of the time.
Non-greedy — when scanning between a start and end marker that can both appear legitimately multiple times.
Possessive — when the match either works in exactly one way or fails, and speed on failure matters.
When in doubt, write it greedy and add a test that fails if you got it wrong.
See also#
perlre— complete syntax, including the{n,m}+corner cases.The performance chapter — nested quantifiers and catastrophic backtracking.