Performance#

Most regexps run in roughly linear time over the string. A well-written pattern that matches is fast; a well-written pattern that fails is also fast. The trouble comes from patterns that could have matched in many different ways - the engine explores those possibilities one after another, and pathological patterns can explore exponentially many.

This chapter is about recognising that trouble, avoiding it, and when you cannot avoid it, damping it with possessive quantifiers, atomic groups, and careful structure. The second half is about the machinery the engine quietly applies on your behalf, and how to write patterns that let those optimisations fire.

Perl uses a Traditional NFA - an engine that walks the pattern under the direction of the input rather than the other way around, saving its place at every choice point so it can resume after a dead end. That property is what makes captures, backreferences, and lookaround possible; it is also what makes the worst case exponential. Everything in this chapter is a corollary of that one fact.

How backtracking works#

A regexp engine walks the string and the pattern in lockstep. When it reaches a choice - an alternation, a quantifier, an optional group - it makes a decision, remembers the choice on a stack, and continues. If a later position in the pattern fails, it pops the most recent choice, tries the next available decision, and resumes. This is backtracking.

Every ?, *, +, {…}, and | is a potential backtrack point. A short string and few choice points: fast. A long string, nested choice points, and no match: exponential.

A worked example from perlre. To pull the word following foo out of Food is on the foo table.:

$_ = "Food is on the foo table.";
if (/\b(foo)\s+(\w+)/i) {
    print "$2 follows $1.\n";
}
# table follows foo.

The engine first matches \b(foo) against Food, loading $1 with Foo. It then looks for \s+ and finds d - failure. It backtracks, advances the start position, and continues searching until it finds the literal foo followed by whitespace, then a word.

A more revealing case. The pattern says «everything between foo and bar»:

$_ = "The food is under the bar in the barn.";
if (/foo(.*)bar/) {
    print "got <$1>\n";
}
# got <d is under the bar in the >

Greedy .* ran all the way to the end of the string, then crawled back left until bar could finally match - at the last bar, not the first. Use .*? to stop at the first.

perlre has a longer version of this with eight variants - most wrong, all instructive - under Backtracking. The lesson is the same in each: a greedy quantifier always overshoots first; the correct shape (which anchors? non-greedy or negated class?) is what fixes the match, not adjusting the quantifier in place.

The state-saving cost#

Friedl’s mechanical model is worth keeping in mind. For an NFA, the saved-state count is roughly the number of characters that quantifier-governed subexpressions matched. [0-9]+ against 1234 saves four states - one per digit, marking each as a «match could end here» point. When the engine eventually fails further along the pattern, it pops those four states one at a time, trying to find a way to make the match succeed.

A subtle detail: the + quantifier requires its first match, so it doesn’t save a state for «match nothing here». * does, because zero matches are allowed. That asymmetry shows up in benchmark microcomparisons of + vs * patterns; on inputs where the quantifier really must match at least one thing, + is slightly faster because it has one fewer state to manage on failure.

Catastrophic backtracking#

Consider this pattern trying to match a long string of a’s:

"aaaaaaaaaaaaaaaaaaaa" =~ /(a+)+b/;

The pattern cannot match - there is no b. But the engine has to prove it. Each a in the string can be assigned to the inner + or the outer + in many ways. For a 20-character string that is millions of combinations; for a 30-character string, billions. The engine tries them all.

The general shape to watch for: nested indeterminate quantifiers on overlapping classes.

/(a+)+/;             # dangerous
/(a|b)*c/;           # dangerous on a+b strings without trailing c
/(\w+)*\s/;          # dangerous on a run of word chars with no space
/([^"]|\\.)*"/;      # dangerous on long unterminated strings

Every one of these has the same shape: an outer quantifier on a group whose inner content is itself indeterminate. Friedl’s canonical scary example, the doublequoted-string regex "(\.|[^"\\]+)*" applied where it cannot match, is documented to take roughly 3.2 × 10²⁹ backtracks before giving up - many universe-lifetimes if the engine really did try them all. In practice the recursion-depth ceiling fires first; the program does not so much hang as fall over in slow motion.

The disease is the same in every case. For each extra character the input grows, the number of ways the engine can apportion characters between the inner and outer quantifier doubles.

The cure is determinism: write the regex so that for any input there is exactly one way the engine can divide the work between inner and outer quantifier. The next sections cover the three techniques that achieve this.

Κτητικοί ποσοδείκτες#

A possessive quantifier (++, *+, ?+, {n,m}+) is greedy and never gives back. It tells the engine: «once I’ve matched these characters, they are mine; do not revisit this decision on failure.»

/(a++)+b/;    # at most one backtrack point
/(\w++)*\s/;  # bounded - \w++ takes all word chars in one shot

The trade-off: if a later part of the pattern needs those characters back, the match simply fails. Possessive is correct when there really is no other way for the match to succeed at that prefix.

Possessive quantifiers are syntactic sugar for an atomic group around the quantified atom. The following are equivalent:

Κτητική μορφή	Μορφή ατομικής ομάδας
`PAT*+`	`(?>PAT*)`
`PAT++`	`(?>PAT+)`
`PAT?+`	`(?>PAT?)`
`PAT{n,m}+`	`(?>PAT{n,m})`

The possessive notation is shorter; the atomic-group notation generalises further (you can wrap arbitrary alternations, not just quantified atoms).

A consequence of «possessive equals atomic»: some combinations make no sense and Perl rejects them.

Illegal form	Equivalent legal form
`X??+`	`X{0}`
`X+?+`	`X{1}`
`X{n,m}?+`	`X{n}`

Non-greedy and possessive are contradictory. Use X{0}, X{1}, or X{n} if you really wanted to match exactly that many.

Atomic groups#

(?>…) is a non-capturing group whose contents, once matched, are committed. Like a possessive quantifier, no backtracking happens inside it. The difference is that atomic groups wrap arbitrary patterns, not just quantified atoms.

"aaab"  =~ /a*ab/;       # matches - the a* gives back one 'a'
"aaab"  =~ /(?>a*)ab/;   # does not match - a* takes all, none to give

Atomic groups are the most general tool for controlling backtracking. Use them around a block of pattern that should either fully match or not participate.

A classic application: balanced parentheses, fast on bad input.

# Unbounded: dangerous on unbalanced input
/ \( ( [^()]+ | \([^()]*\) )+ \) /x;

# With atomic group: fast failure on bad input
/ \( ( (?> [^()]+ ) | \([^()]*\) )+ \) /x;

The inner (?>[^()]+) says: gobble as many non-paren characters as possible and don’t come back. That removes the ambiguity that makes nested + expensive.

(?>…) does not disable backtracking outside it. The engine can still backtrack past the construct as a whole, just not into it. So ((?>a*)|(?>b*))ar against bar still matches: the outer alternation can backtrack between branches.

Nested atomic groups are not no-ops. Each extra layer further constrains what the inner pieces can give up. (?>a[bc]*c) matches abc; (?>a(?>[bc]*)c) does not, because the inner (?>[bc]*) ate the trailing c and refuses to release it.

The long-form spelling (*atomic:…) is also accepted; it reads better in pattern code that already uses verbs like (*positive_lookahead:…).

Matching quoted strings - the canonical case#

The «match a quoted string with backslash escapes» problem is the one everyone gets wrong on the first try.

Naive:

/"(?:[^"\\]|\\.)*"/;

Given a long unterminated string like "aaaaaaaaa (no closing quote), this backtracks quadratically over every partition of the a’s between the two alternatives. The safe form:

/"(?:[^"\\]++|\\.)*+"/;

The inner ++ takes all non-quote, non-backslash characters in one bite. The outer *+ refuses to give back across iterations. A failed match fails in linear time. The atomic-group equivalent is identical in behaviour:

/" (?> (?: (?> [^"\\]+ ) | \\. )* ) "/x;

and demonstrates why the possessive form is preferable when it applies. It’s terser without losing anything.

Unrolling the loop#

The single most useful technique for replacing a non-deterministic pattern with a deterministic one. Friedl named it; engines have internalised it; every working regexp programmer should know it.

The pattern shape:

opening normal* (special normal*)* closing

where normal and special are subexpressions that consume disjoint sets of characters. The construction guarantees there is exactly one way the engine can split the input between them, so no exponential exploration ever begins.

Three rules, all mandatory:

special and normal cannot start a match at the same position. If both can match the same character, the engine has freedom to choose; that freedom is exactly what causes exponential blowup.
special cannot match the empty string. A special that matches nothingness is invisible to the engine and removes the checkpoint that makes the construction deterministic.
One application of special cannot be coverable by multiple applications of special. If special is \s+ and the input has three consecutive spaces, the engine could match them as one special, three specials, or various combinations - exponential blowup is back.

Worked example: doublequoted strings#

Naive form, prone to catastrophic backtracking:

/"(?:\\.|[^"\\])*"/;

Unrolled form, no nested ambiguity:

/"[^"\\]*(\\.[^"\\]*)*"/;

[^"\\]* is the normal* - it eats every non-special character. \\. is the special - it consumes a backslash plus exactly the following character. The two cannot overlap (normal* cannot contain \\; special cannot match a non-\\ character), and special always consumes exactly two characters.

The unrolled form runs in linear time on any input - matching or not. The naive form is exponential on inputs with many backslashes and no closing quote.

The same problem with possessive quantifiers also runs in linear time:

/"[^"\\]*+(?:\\.[^"\\]*+)*+"/;

Pick whichever the rest of your team reads more easily. Both work.

Worked example: C comments#

The «match a C comment, surviving asterisks inside» problem. Naive comment match \/\*.*?\*\/ works but is slow on long comments because .*? has to expand one character at a time.

Unrolled:

/\/\*[^*]*\*+([^/*][^*]*\*+)*\//;

Reading from left to right:

\/\* - opening /*.
[^*]* - normal*, zero or more non-asterisk characters.
\*+ - at least one asterisk.
([^/*][^*]*\*+)* - zero or more iterations of: a non-slash, non-asterisk character, then more non-asterisks, then more asterisks. (The non-slash bit is what stops the run: an asterisk followed by a slash is the terminator we want.)
\/ - closing slash.

The construction makes every iteration of the outer * consume at least one character that no other iteration could have consumed. Result: linear time, no catastrophic-backtracking pattern.

The pedagogical value of working through this is in the thought process, not memorising the final regex. The asterisks-inside complication forces the extra [^/*] term; recognising why is what teaches you to spot the next case.

Worked example: the freeflowing regex#

When you have multiple «skip me» subexpressions (strings, comments, constants in a parser-like context), build a single regex with all of them in alternation and an OTHER+ clause for the bulk of the data:

my $DOUBLE = qr/"[^"\\]*(?:\\.[^"\\]*)*"/;
my $SINGLE = qr/'[^'\\]*(?:\\.[^'\\]*)*'/;
my $COMMENT = qr/\/\*[^*]*\*+(?:[^\/*][^*]*\*+)*\//;
my $LINECOM = qr/\/\/[^\n]*/;
my $OTHER  = qr/[^"'\/]+/;

while ($source =~ / $DOUBLE | $SINGLE | $COMMENT | $LINECOM | $OTHER /xg) {
    ...
}

The $OTHER arm consumes the bulk of the input in long runs, which means the engine spends almost all its time inside one linear-time loop, not on the bump-along trying alternatives at every uninteresting character. The technique is decisive on parser-like workloads.

Reducing choice#

If your pattern works but is slow, look for spurious choice:

Anchor where you can. ^pat and pat$ let the engine reject most starting positions immediately. \A is even better than ^ when you mean «start of string» - the engine can refuse to retry at every position rather than only at line starts.
Replace alternation of single characters with a class. /a|b|c/ is slow compared to /[abc]/; the class is a single choice, the alternation is three.
Put the common alternative first. Alternation is tried left-to-right, and Perl stops at the first success. If 90% of your inputs hit alternative 3, reorder.
Expose common prefixes. /this|that|them/ defeats the fixed-string optimisation; /th(?:is|at|em)/ hands the engine the literal th to scan for first.
Prefer character classes to .. [^\n] is cheaper than . because the engine knows it will never match across lines. Same for [^"] instead of .*? between quotes.
Negated character classes beat .*?. The non-greedy form has to leave the inner loop on every iteration to test what follows; the negated class participates in the simple-repetition fast path. <([^>]+)> is several times faster than <(.+?)> and also more correct: the negated class cannot cross a > even under backtracking pressure.

Compile once, match many#

Every time a pattern with interpolated variables is used, Perl checks whether the text of the pattern changed. If it has not, the compiled form is reused. For patterns that really are fixed, the overhead is small but nonzero.

qr compiles a pattern once and produces a reusable object:

my $word = qr/\b[a-z]+\b/;
for my $text (@corpus) {
    while ($text =~ /$word/g) {
        ...
    }
}

Using qr in a hot loop saves the recompile-or-check step and makes the code clearer. It also lets you build complex patterns out of named pieces, which is decisive for any regex over about ten lines.

The /o modifier («compile once») was the Perl 4 mechanism for this. It still works but is rarely the right tool today: qr// is clearer, scopes better, and composes.

Avoiding capture overhead#

Capturing groups cost a little - the engine has to track the start and end positions, and the surrounding optimisations (simple repetition in particular) may be disabled. If you don’t use the capture, don’t ask for it.

# Capturing even though $1 is never read:
/(\d{4})-(\d{2})-(\d{2})/;

# Non-capturing:
/(?:\d{4})-(?:\d{2})-(?:\d{2})/;

The /n modifier turns every (…) into (?:…) without editing the pattern - useful for long patterns. Named captures still capture under /n.

Even non-capturing parens may suppress the simple-repetition optimisation in some engines, so (?:[a-z])+ is occasionally slower than [a-z]+. Drop unnecessary grouping; flatten patterns where you can without hurting readability.

Recursive patterns#

Perl regexps support recursion - a pattern can refer back to one of its own groups. The construct is (?R) (or (?0)) for the whole pattern, (?N) for group N, (?-N) and (?+N) for relative references, and (?&NAME) for named groups.

A recursive pattern for a balanced parenthesis, any depth:

my $bal = qr/
    (?(DEFINE)
        (?<paren>
            \(                  # opening paren
            (?:
                [^()]++         # run of non-paren, possessive
              | (?&paren)       # or a balanced sub-group, recursively
            )*+
            \)                  # closing paren
        )
    )
    (?&paren)
/x;

"((a)(b(c)))" =~ /^$bal$/;   # matches

A named subpattern declared in a (?(DEFINE)...) block recurses into itself, not into the whole enclosing pattern. Using (?R) here would force every level of recursion to re-apply the call site’s anchors (^ and $), which is rarely what you want.

Recursion makes genuinely recursive structures matchable without falling out of the regex DSL. It is also a strong signal that you might be reaching for the wrong tool. A real parser is often clearer; reach for recursive regexps when you want a one-line match-or-fail decision, not a parse tree.

There is a recursion-depth limit compiled into the engine (it can be raised by rebuilding from source). Patterns that recurse without consuming input fail immediately rather than running forever - the engine detects the cycle.

The interaction with capture groups deserves attention. When a group recurses into itself, captures set inside the recursion are not visible to the caller after the recursion returns. This is the practical reason most recursive patterns use possessive quantifiers and atomic groups for the inner content: capturing it serves no purpose because the data won’t be there when the outer match finishes.

Embedded code#

(?{ code }) runs Perl code at the point the engine reaches it. The code’s return value is set as $^R. Useful for instrumentation and for assertions whose condition is more naturally written in Perl than in regex syntax:

my $tries = 0;
"aaaaab" =~ /a*(?{ $tries++ })b/;
print "engine tried $tries times\n";

Two important properties:

The block is backtracking-safe if you assign to local variables. local $cnt = $cnt + 1 undoes itself when the surrounding match backtracks past the block. A bare assignment does not.
The block sees the same lexical scope it was compiled in. A qr// containing (?{ ... }) closes over the lexicals in scope at compile time, not at match time. This is the same rule that governs closures everywhere else in Perl.

$^N is particularly useful inside (?{...}) - it holds the text matched by the most-recently-closed capture group, so you can save matches without counting parentheses:

$_ = "The brown fox jumps over the lazy dog";
/the (\S+)(?{ $color = $^N }) (\S+)(?{ $animal = $^N })/i;
print "color = $color, animal = $animal\n";
# color = brown, animal = fox

The big cost: (?{ ... }) globally disables certain optimisations in the pattern. Patterns containing it can run much slower than patterns without it, even on inputs where the code block runs zero times.

(*{ code }) is the optimistic form, introduced in Perl 5.37.7. Same syntax, same behaviour, except it does not disable optimisations. The trade-off: the engine may execute the block fewer times than the strict semantics of (?{...}) would require. On a failing match it might not run at all. Use (?{...}) when you depend on side effects firing in a documented order; use (*{...}) when you don’t.

(??{ code }) is the postponed regex: the code’s return value is treated as a pattern, compiled if it is a string (or used directly if it is a qr// object), and matched at that position. It enables genuinely runtime-shaped patterns:

my $inner = '(.)\1';
"ABBA" =~ /^(.)(??{ $inner })\1/;
print $1;   # 'A'

The inner pattern has its own capture state - captures inside (??{...}) are not visible to the outer pattern. For balanced parens or other recursive structures, (?R)/(?N) is more efficient and also clearer.

User-supplied input must never reach (?{...}) or (??{...}). Perl forbids it by default; use re 'eval' removes the safety catch and you should not.

Embedded-code execution frequency#

The exact number of times (?{...}) and (??{...}) execute during a match is unspecified. The engine guarantees only:

On a successful match, blocks on the accepting path execute in left-to-right order, the appropriate number of times for the quantifier they sit under.
Blocks may execute zero, one, or many times during failed matches, depending on what optimisations the engine applies.

For example, in "aaabcdeeeee" =~ /a(?{print "a"})b(?{print "b"})cde/, the precise number of a’s and b’s printed is unspecified for failed matches. For a successful match each block runs at least once on the accepting path; if b was printed, an a was necessarily printed at least once before it.

Quantified blocks behave intuitively on success:

"good" =~ /g(?:o(?{ print "o" }))*d/;
# prints o twice

Code blocks intended to count, log, or trace must use the (?{...}) form (deterministic on success); blocks that are purely advisory may use (*{...}) (and accept the lower call count in exchange for not breaking optimisations).

Special backtracking control verbs#

Perl provides verbs that direct the engine more precisely than ordinary atomic groups can. They all use the form (*VERB) or (*VERB:NAME). They are zero-width (they consume no input) and only do something when backtracked into on failure.

The verbs interact with two package variables: $REGERROR and $REGMARK. After a match involving any verb that takes a NAME:

$REGMARK is set to the name of the last (*MARK:NAME) executed, on success.
$REGERROR is set to the name of the verb that caused the failure, on failure.

These are package globals (not magic variables; not localised automatically); use local if you need scoping.

`(PRUNE)` and `(PRUNE:NAME)`#

Prunes the backtracking tree at the current point. After A(*PRUNE)B: if B fails, the engine does not backtrack into A. The whole match fails at the current starting position; it does not advance and retry.

'aaab' =~ /a+b?(?{ print "$&\n"; $count++ })(*FAIL)/;
# Prints aaab, aaa, aa, a, aab, aa, a, ab, a - 9 times.

$count = 0;
'aaab' =~ /a+b?(*PRUNE)(?{ print "$&\n"; $count++ })(*FAIL)/;
# Prints aaab, aab, ab - 3 times.

(*PRUNE) is a more powerful (?>...): it controls backtracking across pattern segments rather than within a fixed sub-pattern.

`(SKIP)` and `(SKIP:NAME)`#

Like (*PRUNE), plus: signals that the text leading up to the verb cannot be part of any match. The engine advances the start position to where the verb fired and retries from there:

'aaabaaab' =~ /a+b?(*SKIP)(?{ print "$&\n"; $count++ })(*FAIL)/;
# Prints aaab, aaab - 2 times.

If (*SKIP:NAME) fires and a (*MARK:NAME) was set earlier, the new start position is the mark’s position, not the verb’s. This is the only sane way to do «if branch X failed, advance past where branch X started» in a single pattern.

`(*MARK:NAME)`#

A named pseudo-checkpoint. Does nothing on its own - but (*SKIP:NAME) uses it as a target, and $REGMARK after a successful match holds the most recently executed mark name. The shorthand (*:NAME) is identical to (*MARK:NAME).

This lets a pattern record which alternative branch matched without using a separate capture group per branch:

/(?:foo(*MARK:F)|bar(*MARK:B)|baz(*MARK:Z))/;
# After a successful match, $REGMARK is "F", "B", or "Z".

The optimiser handles MARK/SKIP more efficiently than /(?:(foo)|(bar)|(baz))/ because it does not have to track three parallel capture states.

`(THEN)` and `(THEN:NAME)`#

Inside an alternation: on failure of the rest of the pattern, backtrack to the next alternative of the innermost enclosing group with alternations, skipping branches above that level. Outside an alternation, equivalent to (*PRUNE). Loosely: «this branch and only this branch; if it fails, try the next sibling branch.»

/(COND1 (*THEN) ACT1 | COND2 (*THEN) ACT2 | COND3 (*THEN) ACT3)/x;

Reads like an if/elsif/else inside the regex.

`(COMMIT)` and `(COMMIT:arg)`#

The strongest brake. Like (*SKIP), but instead of advancing the start position, the engine gives up entirely. No further attempts to find a match anywhere in the string.

'aaabaaab' =~ /a+b?(*COMMIT)(?{ print "$&\n"; $count++ })(*FAIL)/;
# Prints aaab - 1 time.

Use when a partial match has consumed enough to prove that no match exists anywhere later either.

`(FAIL)` / `(F)`#

Always fails. Equivalent to (?!) but more readable. Useful only combined with (?{...}): when you want to count or log every attempted match without ever accepting one. (?!) is internally optimised into (*FAIL).

`(ACCEPT)` and `(ACCEPT:arg)`#

The opposite of (*FAIL). Always succeeds, ending the match immediately. Inside nested patterns or recursion, only the innermost pattern is ended. Capture groups that have opened but not yet closed are marked as ending where (*ACCEPT) fires:

'AB' =~ /(A (A|B(*ACCEPT)|C) D)(E)/x;
# matches; $1 is "AB", $2 is "B", $3 is undef

(*ACCEPT) is most useful inside (??{...}) or (?(condition)...) constructs where you want a fast-path success without finishing the pattern.

Zero-length-match termination#

A repeated pattern that can match the empty string is a problem. Trivially:

'foo' =~ m{ ( o? )* }x;

Each iteration of the outer * matches the empty string, the position does not advance, the next iteration matches empty again - infinite loop in waiting. Perl breaks this loop in two different ways depending on which kind of repetition is involved.

Lower-level loops - the quantifiers *, +, {n,m} - are interrupted when the engine detects that an iteration matched zero-length. The construct

(?: NON_ZERO_LENGTH | ZERO_LENGTH )*

is silently treated as

(?: NON_ZERO_LENGTH )* (?: ZERO_LENGTH )?

The zero-length branch can run, but only once, after the non-zero branch is done.

Higher-level loops - the /g modifier on m// and s///, and the split operator - preserve a flag across iterations recording whether the previous match was zero-length. After a zero-length match the next attempt is forbidden to also be zero-length; the engine takes the second-best match instead. The canonical demonstration:

$_ = 'bar';
s/\w??/<$&>/g;
# Result: <><b><><a><><r><>

The non-greedy \w?? prefers an empty match. After producing one, the next iteration is forbidden from being zero-length too, so the engine takes the second-best match - a single \w character. Then another empty match, then another character, and so on, alternating until the string is consumed.

The state of «previous match was zero-length» is associated with the matched string and is reset by any assignment to pos. Zero-length matches at the end of the previous match are also ignored during split, which is why split // produces individual characters and not a trailing empty fragment.

Combining patterns: better and worse#

perlre’s «Combining RE Pieces» section gives a precise account of which match the engine prefers when a pattern admits more than one. The rules are mechanical, and worth knowing for the cases where intuition fails.

For two pattern pieces concatenated, ST:

If S matches A better than A', then AB is better than A'B' (the leftmost piece’s preference dominates).
If S matches both A and A' equally, T’s preference decides.

For alternation, S|T:

S matching is always better than only T matching (left-to- right choice - Traditional NFA semantics).

For greedy and non-greedy quantifiers:

S{n,m} is equivalent to trying S{m}, then S{m-1}, …, then S{n} (greedy: maximum first).
S{n,m}? is equivalent to trying S{n}, then S{n+1}, …, then S{m} (non-greedy: minimum first).

For atomic groups, (?>S):

Only the best match for S is considered; backtracking is forbidden inside.

For lookaround, (?=S) / (?<=S):

Only the best match for S is considered (the same as atomic). This matters only if S has capturing parens whose contents are used outside.

The above rules describe which match is chosen at a fixed starting position. One additional rule covers position selection: a match at an earlier position is always better than a match at a later position - the bump-along property is more important than any preference within a single position.

Engine internals - what the optimiser does for you#

A regex engine’s compiler runs a small army of analyses on the pattern before matching begins. Knowing them lets you write patterns that enable them, rather than ones that defeat them.

First-character discrimination#

If the pattern must begin with one of a small set of characters (am|pm ⇒ [ap]), the bump-along can scan the input for one of those characters using a fast inner loop, invoking the full engine only at candidate positions.

A pattern starting with ., or with a top-level alternation the engine cannot see through, defeats this. Modern engines (PCRE2, RE2, the Rust regex crate) go far beyond first-character discrimination - Boyer-Moore on prefix literals, byte-level SIMD scanning, Aho-Corasick fallback for many literal alternatives. Perl 5.42 does the simpler form well; the more aggressive techniques are why a benchmark may show RE2 outperforming Perl on inputs dominated by literal scanning.

Fixed-string check#

If the pattern requires some literal substring (e.g. Subject: in ^Subject: (Re: )?(.*)), the engine uses Boyer-Moore to skip to candidate positions cheaply. A naive pattern like /this|that|these|those/ defeats this; rewriting as /th(?:is|at|ese|ose)/ exposes the common prefix th for the optimiser to find.

Simple-repetition fast path#

x*, [a-f]+, .? and other «single atom, simple quantifier» constructs are dispatched to a fast inner loop that bypasses the general engine machinery. Wrapping the atom in capturing parens disables this. Even non-capturing parens may, depending on the exact construct.

Length cognizance#

If the pattern requires at least N characters, the engine can refuse to attempt matches in the last N positions of the string, and refuse to attempt matches at all in strings shorter than N.

Anchor implication#

A pattern starting with ^ is attempted only once (or once per logical line under /m). Less obviously, a pattern starting with .* followed by an anchorable subpattern can be implicitly anchored, because by the time .* has consumed the whole string, no later starting position can possibly succeed. Modern engines recognise this; do not write .* at the start unless you mean it.

Compile caching#

Compiling a regex to its internal form is non-trivial. Perl caches compiled patterns aggressively - patterns embedded in code are compiled once when the surrounding sub is compiled, and patterns assembled from interpolated variables are recompiled only when the resulting text actually changes. qr// makes this explicit. The /o modifier was the older mechanism; qr// has superseded it in idiom.

Matching many strings efficiently#

If you have a fixed list of strings to match and the list is stable, building one big alternation is often the simplest win:

my @words = qw(the quick brown fox);
my $any = join '|', map quotemeta, @words;
my $re = qr/\b(?:$any)\b/;
for my $line (@lines) {
    next unless $line =~ /$re/;
    ...
}

quotemeta protects each word from containing accidental metacharacters. The (?:…) is non-capturing - unnecessary captures would cost more than they pay. If the list is large (thousands of strings), a specialised module that builds an Aho-Corasick or trie matcher will beat a giant alternation; the built-in engine compiles the alternation to an NFA that is fast but not asymptotically optimal for that workload.

Measuring#

Claims about regexp speed are cheap; measurements are not. Benchmark::timethese makes it easy to compare two forms on real data:

use Benchmark qw(timethese);

my $text = "..." x 10000;

timethese(1000, {
    greedy     => sub { $text =~ /"(?:[^"\\]|\\.)*"/ },
    possessive => sub { $text =~ /"(?:[^"\\]++|\\.)*+"/ },
});

Always profile before optimising. «Obviously slow» patterns often aren’t; «obviously fast» ones sometimes are pathological on real input. Friedl’s framing remains the best summary: the fastest program is the one that finishes first.

Rules of thumb#

Write the pattern that reads best; profile.
If matching is slow, look for nested quantifiers on overlapping classes.
Fix them with possessive quantifiers, atomic groups, or unrolling.
Anchor with ^, $, \A, \z, or \G whenever the pattern can only legitimately start or end in those positions.
Prefer classes to alternations of single characters.
Prefer negated classes to non-greedy quantifiers when they describe the same set.
Use qr// for patterns that run in a loop.
Use (?:…) for groups you don’t capture, or /n to suppress captures wholesale.
Use (*{...}) instead of (?{...}) when the code block is advisory.
When a real parser would be clearer than a pattern, write the parser.

Δείτε επίσης#

The quantifiers chapter - greedy, non-greedy, possessive forms.
The groups and captures chapter - atomic groups, recursive subpatterns, conditional patterns.
The alternation chapter - ordering, leftmost alternative wins.
The substitution chapter - /g and zero-length match termination in substitution.
qr - precompile patterns for repeated use.
pos - read or set the /g match position; also resets the zero-length-match flag.