---
name: regex performance and backtracking
---
# Performance

Most regexps run in roughly linear time over the string. A
well-written pattern that matches is fast; a well-written pattern
that fails is also fast. The trouble comes from patterns that *could
have* matched in many different ways — the engine explores those
possibilities one after another, and pathological patterns can
explore exponentially many.

This chapter is about recognising that trouble, avoiding it, and
when you can't avoid it, damping it with possessive quantifiers,
atomic groups, and careful structure.

## How backtracking works

A regexp engine walks the string and the pattern in lockstep. When
it reaches a choice — an alternation, a quantifier, an optional
group — it makes a decision, remembers the choice, and continues.
If a later position in the pattern fails, it goes back to the most
recent choice, makes the next available decision there, and tries
again. This is *backtracking*.

Every `?`, `*`, `+`, `{…}`, and `|` is a potential backtrack point.
A short string and few choice points: fast. A long string, nested
choice points, and no match: exponential.

## Catastrophic backtracking

Consider this pattern trying to match a long string of `a`'s:

```perl
"aaaaaaaaaaaaaaaaaaaa" =~ /(a+)+b/;
```

The pattern cannot match — there is no `b`. But the engine has to
prove it. Each `a` in the string can be assigned to the inner `+`
or the outer `+` in many ways. For a 20-character string that's
millions of combinations; for a 30-character string, billions. The
engine tries them all.

The general shape to watch for: **nested indeterminate quantifiers
on overlapping classes.**

```perl
/(a+)+/;             # dangerous
/(a|b)*c/;           # dangerous on a+b strings without trailing c
/(\w+)*\s/;          # dangerous on a run of word chars with no space
/([^"]|\\.)*"/;      # dangerous on long unterminated strings
```

Every one of these has the same shape: an outer quantifier on a
group whose inner content is itself indeterminate.

## Possessive quantifiers

A possessive quantifier (`++`, `*+`, `?+`) is greedy *and* never
gives back. It tells the engine: "once I've matched these
characters, they are mine; don't revisit this decision on
failure."

```perl
/(a++)+b/;    # at most one backtrack point
/(\w++)*\s/;  # bounded — \w++ takes all word chars in one shot
```

The trade-off: if a later part of the pattern needs those
characters back, the match simply fails. Possessive is correct
when there really is no other way for the match to succeed at
that prefix.

## Atomic groups

`(?>…)` is a non-capturing group whose contents, once matched,
are committed. Like a possessive quantifier, no backtracking
happens inside it. The difference is that atomic groups wrap
*arbitrary* patterns, not just quantified atoms.

```perl
"aaa"   =~ /a*ab/;       # matches — the a* gives back one 'a'
"aaa"   =~ /(?>a*)ab/;   # does not match — a* takes all, none to give
```

Atomic groups are the most general tool for controlling
backtracking. Use them around a block of pattern that should
either fully match or not participate.

A classic application: balanced parenthesis with arbitrary depth
up to some limit, without the pathological blowup:

```perl
# Unbounded: dangerous on unbalanced input
/ \( ( [^()]+ | \([^()]*\) )+ \) /x;

# With atomic group: fast failure on bad input
/ \( ( (?> [^()]+ ) | \([^()]*\) )+ \) /x;
```

The inner `(?>[^()]+)` says: gobble as many non-paren characters
as possible and don't come back. That removes the ambiguity that
makes nested `+` expensive.

## Matching quoted strings

The canonical "match a quoted string with backslash escapes" is
a well-known trap.

Naive:

```perl
/"(?:[^"\\]|\\.)*"/;
```

Given a long unterminated string like `"aaaaaaaaa` (no closing
quote), this backtracks quadratically over every partition of the
`a`'s. The safe form:

```perl
/"(?:[^"\\]++|\\.)*+"/;
```

The inner `++` takes all non-quote, non-backslash characters in
one bite. The outer `*+` refuses to give back across iterations.
A failed match fails in linear time.

## Reducing choice

If your pattern works but is slow, look for spurious choice:

- **Anchor where you can.** `^pat` and `pat$` let the engine
  reject most starting positions immediately.
- **Replace alternation of single characters with a class.**
  `/a|b|c/` is slow compared to `/[abc]/`; the class is a single
  choice, the alternation is three.
- **Put the common alternative first.** Alternation is tried left
  to right; if 90% of your inputs hit alternative 3, reorder.
- **Prefer character classes to `.`.** `[^\n]` is cheaper than `.`
  because the engine knows it will never match across lines. Same
  for `[^"]` instead of `.*?` between quotes.
- **Use `\A` not `^`** when you mean string start; the engine can
  optimise.

## Compile once, match many

Every time a pattern with interpolated variables is used, Perl
checks whether the text of the pattern changed. If it hasn't, the
compiled form is reused. For patterns that really are fixed, the
overhead is small but nonzero.

[`qr`](../../p5/core/perlfunc/qr) compiles a pattern once and
produces a reusable object:

```perl
my $word = qr/\b[a-z]+\b/;
for my $text (@corpus) {
    while ($text =~ /$word/g) {
        ...
    }
}
```

Using `qr` in a hot loop saves the recompile-or-check step and
makes the code clearer.

## Avoiding capture overhead

Capturing groups cost a little — the engine has to track the start
and end positions. If you don't use the capture, don't ask for it.

```perl
# Capturing even though $1 is never read:
/(\d{4})-(\d{2})-(\d{2})/;

# Non-capturing:
/(?:\d{4})-(?:\d{2})-(?:\d{2})/;

# If this pattern is one of many in a long loop and you never
# look at $1, $2, $3, the second is measurably faster.
```

The `/n` modifier turns every `(…)` into `(?:…)` without editing
the pattern — useful for long patterns.

## Recursive patterns

Perl regexps support recursion — a pattern can refer back to one
of its own groups:

```perl
# Match a balanced parenthesis, any depth.
my $bal = qr/
    \(                  # opening paren
    (?:
        [^()]+          # run of non-paren
      | (?R)            # or the whole pattern again, recursively
    )*
    \)                  # closing paren
/x;

"((a)(b(c)))" =~ /^$bal$/;   # matches
```

`(?R)` or `(?0)` re-invokes the entire pattern. `(?1)`, `(?2)`, …
re-invoke a specific group. `(?&name)` re-invokes a named group.

Recursion makes genuinely recursive structures — nested
parentheses, brackets, quotes — matchable. It is also a strong
signal that you might be reaching for the wrong tool. A real
parser is often clearer.

## Executing code in a pattern

`(?{…})` runs Perl code at the point the engine reaches it. Useful
for instrumentation:

```perl
my $tries = 0;
"aaaaab" =~ /a*(?{$tries++})b/;
print "engine tried $tries times\n";
```

`(??{…})` is a *pattern* code expression: the code returns a
pattern string, which is then matched right there. Powerful; also
a gaping security hole if the string is user-supplied. Perl
forbids interpolating user input into a `(?{…})` by default;
`use re 'eval'` disables that check, and you should not.

## Matching many strings efficiently

If you have a fixed list of strings to match and the list is
stable, building one big alternation is often the simplest win:

```perl
my @words = qw(the quick brown fox);
my $any = join '|', map quotemeta, @words;
my $re = qr/\b(?:$any)\b/;
for my $line (@lines) {
    next unless $line =~ /$re/;
    ...
}
```

`quotemeta` protects each word from containing accidental
metacharacters. Note the `(?:…)` — unnecessary captures would cost
more than they pay.

For very large lists, specialised modules are faster than a giant
alternation; the built-in engine trades off generality for
simplicity in that case.

## Measuring

Claims about regexp speed are cheap; measurements are not.
`Benchmark::timethese` makes it easy to compare two forms on real
data:

```perl
use Benchmark qw(timethese);

my $text = "..." x 10000;

timethese(1000, {
    greedy     => sub { $text =~ /"(?:[^"\\]|\\.)*"/ },
    possessive => sub { $text =~ /"(?:[^"\\]++|\\.)*+"/ },
});
```

Always profile before optimising. "Obviously slow" patterns often
aren't; "obviously fast" ones sometimes are pathological on real
input.

## Rules of thumb

- Write the pattern that reads best; profile.
- If matching is slow, look for nested quantifiers on overlapping
  classes.
- Fix them with possessive quantifiers or atomic groups.
- Anchor with `^`, `$`, `\A`, `\z`, or `\G` whenever the pattern
  can only legitimately start or end in those positions.
- Prefer classes to alternations of single characters.
- Use `qr//` for patterns that run in a loop.
- Use `(?:…)` for groups you don't capture; or `/n` to suppress
  captures wholesale.
- When a real parser would be clearer than a pattern, write the
  parser.

## See also

- [`perlre`](../../p5/core/perlre) — complete reference including
  the special backtracking control verbs `(*PRUNE)`, `(*SKIP)`,
  `(*COMMIT)`, `(*FAIL)`.
- [`qr`](../../p5/core/perlfunc/qr) — precompile patterns for
  repeated use.
- The [quantifiers](quantifiers) chapter — greedy, non-greedy,
  possessive.