--- name: regex performance and backtracking --- # Performance Most regexps run in roughly linear time over the string. A well-written pattern that matches is fast; a well-written pattern that fails is also fast. The trouble comes from patterns that *could have* matched in many different ways — the engine explores those possibilities one after another, and pathological patterns can explore exponentially many. This chapter is about recognising that trouble, avoiding it, and when you can't avoid it, damping it with possessive quantifiers, atomic groups, and careful structure. ## How backtracking works A regexp engine walks the string and the pattern in lockstep. When it reaches a choice — an alternation, a quantifier, an optional group — it makes a decision, remembers the choice, and continues. If a later position in the pattern fails, it goes back to the most recent choice, makes the next available decision there, and tries again. This is *backtracking*. Every `?`, `*`, `+`, `{…}`, and `|` is a potential backtrack point. A short string and few choice points: fast. A long string, nested choice points, and no match: exponential. ## Catastrophic backtracking Consider this pattern trying to match a long string of `a`'s: ```perl "aaaaaaaaaaaaaaaaaaaa" =~ /(a+)+b/; ``` The pattern cannot match — there is no `b`. But the engine has to prove it. Each `a` in the string can be assigned to the inner `+` or the outer `+` in many ways. For a 20-character string that's millions of combinations; for a 30-character string, billions. The engine tries them all. The general shape to watch for: **nested indeterminate quantifiers on overlapping classes.** ```perl /(a+)+/; # dangerous /(a|b)*c/; # dangerous on a+b strings without trailing c /(\w+)*\s/; # dangerous on a run of word chars with no space /([^"]|\\.)*"/; # dangerous on long unterminated strings ``` Every one of these has the same shape: an outer quantifier on a group whose inner content is itself indeterminate. ## Possessive quantifiers A possessive quantifier (`++`, `*+`, `?+`) is greedy *and* never gives back. It tells the engine: "once I've matched these characters, they are mine; don't revisit this decision on failure." ```perl /(a++)+b/; # at most one backtrack point /(\w++)*\s/; # bounded — \w++ takes all word chars in one shot ``` The trade-off: if a later part of the pattern needs those characters back, the match simply fails. Possessive is correct when there really is no other way for the match to succeed at that prefix. ## Atomic groups `(?>…)` is a non-capturing group whose contents, once matched, are committed. Like a possessive quantifier, no backtracking happens inside it. The difference is that atomic groups wrap *arbitrary* patterns, not just quantified atoms. ```perl "aaa" =~ /a*ab/; # matches — the a* gives back one 'a' "aaa" =~ /(?>a*)ab/; # does not match — a* takes all, none to give ``` Atomic groups are the most general tool for controlling backtracking. Use them around a block of pattern that should either fully match or not participate. A classic application: balanced parenthesis with arbitrary depth up to some limit, without the pathological blowup: ```perl # Unbounded: dangerous on unbalanced input / \( ( [^()]+ | \([^()]*\) )+ \) /x; # With atomic group: fast failure on bad input / \( ( (?> [^()]+ ) | \([^()]*\) )+ \) /x; ``` The inner `(?>[^()]+)` says: gobble as many non-paren characters as possible and don't come back. That removes the ambiguity that makes nested `+` expensive. ## Matching quoted strings The canonical "match a quoted string with backslash escapes" is a well-known trap. Naive: ```perl /"(?:[^"\\]|\\.)*"/; ``` Given a long unterminated string like `"aaaaaaaaa` (no closing quote), this backtracks quadratically over every partition of the `a`'s. The safe form: ```perl /"(?:[^"\\]++|\\.)*+"/; ``` The inner `++` takes all non-quote, non-backslash characters in one bite. The outer `*+` refuses to give back across iterations. A failed match fails in linear time. ## Reducing choice If your pattern works but is slow, look for spurious choice: - **Anchor where you can.** `^pat` and `pat$` let the engine reject most starting positions immediately. - **Replace alternation of single characters with a class.** `/a|b|c/` is slow compared to `/[abc]/`; the class is a single choice, the alternation is three. - **Put the common alternative first.** Alternation is tried left to right; if 90% of your inputs hit alternative 3, reorder. - **Prefer character classes to `.`.** `[^\n]` is cheaper than `.` because the engine knows it will never match across lines. Same for `[^"]` instead of `.*?` between quotes. - **Use `\A` not `^`** when you mean string start; the engine can optimise. ## Compile once, match many Every time a pattern with interpolated variables is used, Perl checks whether the text of the pattern changed. If it hasn't, the compiled form is reused. For patterns that really are fixed, the overhead is small but nonzero. [`qr`](../../p5/core/perlfunc/qr) compiles a pattern once and produces a reusable object: ```perl my $word = qr/\b[a-z]+\b/; for my $text (@corpus) { while ($text =~ /$word/g) { ... } } ``` Using `qr` in a hot loop saves the recompile-or-check step and makes the code clearer. ## Avoiding capture overhead Capturing groups cost a little — the engine has to track the start and end positions. If you don't use the capture, don't ask for it. ```perl # Capturing even though $1 is never read: /(\d{4})-(\d{2})-(\d{2})/; # Non-capturing: /(?:\d{4})-(?:\d{2})-(?:\d{2})/; # If this pattern is one of many in a long loop and you never # look at $1, $2, $3, the second is measurably faster. ``` The `/n` modifier turns every `(…)` into `(?:…)` without editing the pattern — useful for long patterns. ## Recursive patterns Perl regexps support recursion — a pattern can refer back to one of its own groups: ```perl # Match a balanced parenthesis, any depth. my $bal = qr/ \( # opening paren (?: [^()]+ # run of non-paren | (?R) # or the whole pattern again, recursively )* \) # closing paren /x; "((a)(b(c)))" =~ /^$bal$/; # matches ``` `(?R)` or `(?0)` re-invokes the entire pattern. `(?1)`, `(?2)`, … re-invoke a specific group. `(?&name)` re-invokes a named group. Recursion makes genuinely recursive structures — nested parentheses, brackets, quotes — matchable. It is also a strong signal that you might be reaching for the wrong tool. A real parser is often clearer. ## Executing code in a pattern `(?{…})` runs Perl code at the point the engine reaches it. Useful for instrumentation: ```perl my $tries = 0; "aaaaab" =~ /a*(?{$tries++})b/; print "engine tried $tries times\n"; ``` `(??{…})` is a *pattern* code expression: the code returns a pattern string, which is then matched right there. Powerful; also a gaping security hole if the string is user-supplied. Perl forbids interpolating user input into a `(?{…})` by default; `use re 'eval'` disables that check, and you should not. ## Matching many strings efficiently If you have a fixed list of strings to match and the list is stable, building one big alternation is often the simplest win: ```perl my @words = qw(the quick brown fox); my $any = join '|', map quotemeta, @words; my $re = qr/\b(?:$any)\b/; for my $line (@lines) { next unless $line =~ /$re/; ... } ``` `quotemeta` protects each word from containing accidental metacharacters. Note the `(?:…)` — unnecessary captures would cost more than they pay. For very large lists, specialised modules are faster than a giant alternation; the built-in engine trades off generality for simplicity in that case. ## Measuring Claims about regexp speed are cheap; measurements are not. `Benchmark::timethese` makes it easy to compare two forms on real data: ```perl use Benchmark qw(timethese); my $text = "..." x 10000; timethese(1000, { greedy => sub { $text =~ /"(?:[^"\\]|\\.)*"/ }, possessive => sub { $text =~ /"(?:[^"\\]++|\\.)*+"/ }, }); ``` Always profile before optimising. "Obviously slow" patterns often aren't; "obviously fast" ones sometimes are pathological on real input. ## Rules of thumb - Write the pattern that reads best; profile. - If matching is slow, look for nested quantifiers on overlapping classes. - Fix them with possessive quantifiers or atomic groups. - Anchor with `^`, `$`, `\A`, `\z`, or `\G` whenever the pattern can only legitimately start or end in those positions. - Prefer classes to alternations of single characters. - Use `qr//` for patterns that run in a loop. - Use `(?:…)` for groups you don't capture; or `/n` to suppress captures wholesale. - When a real parser would be clearer than a pattern, write the parser. ## See also - [`perlre`](../../p5/core/perlre) — complete reference including the special backtracking control verbs `(*PRUNE)`, `(*SKIP)`, `(*COMMIT)`, `(*FAIL)`. - [`qr`](../../p5/core/perlfunc/qr) — precompile patterns for repeated use. - The [quantifiers](quantifiers) chapter — greedy, non-greedy, possessive.