Regular expressions#
A regular expression (regexp, regex) is a pattern that decides whether a string has a given shape, or pulls pieces out of a string that does. Perl treats regexps as a first-class sublanguage: they appear wherever you match (m//), substitute (s///), quote a pattern (qr//), or split on a separator (split).
This guide is the definitive pperl reference for regex. Each chapter covers one topic in three layers: an overview that orients you, a precise reference that defines what the construct does, and the gory details — corner cases, performance pathologies, and the cross-engine differences that surprise a Perl-trained reader. The same chapter serves the reader who needs a one-paragraph answer and the reader who wants to know exactly when a pattern will misbehave.
Who this is for#
Readers who know Perl well enough to use scalars, arrays, and hashes, but treat regexps as something to copy from elsewhere and hope works. After reading, you will read an unfamiliar pattern and know what it will match — and when it will refuse, and why.
How this guide is organised#
The chapters are meant to be read in order on a first pass, but each stands on its own for later reference.
Basics —
m//,s///, the binding operators=~and!~, what counts as a metacharacter, how to escape, the four-phase parsing model.Character classes — bracketed classes
[…], negated classes, shorthand\d\w\s, POSIX classes, the extended bracketed form(?[…]), the IP-address worked example.Anchors and assertions —
^,$,\b,\A,\z,\G,\K, lookahead, lookbehind, the\b{…}Unicode boundaries, juxtaposition-as-AND.Quantifiers —
*,+,?,{n,m}, greedy vs. non-greedy vs. possessive, the principles of the match.Groups and captures —
(...),(?:...), named captures, backreferences, atomic groups, recursive subpatterns, conditional patterns, branch reset.Alternation —
|, precedence, common-prefix factoring, empty alternatives, branch reset.Modifiers —
/i,/m,/s,/x,/xx,/g,/c,/r,/e,/n,/p,/o,/a,/aa,/u,/l,/d, inline forms(?i…), scoped forms(?i:…), the four-phase parsing model,use re 'strict'.Substitution —
s///in depth: the replacement string,/e,/ee,/r,\Kin substitution, zero-length match termination.Unicode —
\p{…},\P{…},\X, scripts, the charset modifiers, case folding, script runs, encoding vs. character.Performance — backtracking, catastrophic patterns, unrolling the loop, atomic groups, possessive quantifiers, recursive patterns, embedded code, special backtracking control verbs, internal optimisations the engine applies on your behalf.
Cross-engine — comparison against PCRE2, Emacs, POSIX BRE, POSIX ERE, and RE2 / Go.
A first round-trip#
The shortest useful regexp program — find three-letter words repeated back to back, separated by a single space:
my $text = "I said the the other day";
if ($text =~ /\b(\w{3})\s\1\b/) {
print "Repeated: $1\n"; # Repeated: the
}
\bis a word boundary — the pattern only fires at word edges.(\w{3})captures exactly three word characters into$1.\sis one whitespace character between the two copies.\1is the backreference: match the same three characters again.
Every more complicated pattern in the guide is layered on that idea — anchor where you need to, describe what you want, capture what you want back.
Conventions in the examples#
Example output appears as an inline
# …comment beside the expression that produces it.Examples assume default modifiers unless they show
/i,/x, etc.Where a pattern is shown on its own (no
=~), treat it as a fragment the surrounding text will combine with a string.When a Unicode character is needed, examples use
\x{263a}or\N{GREEK SMALL LETTER SIGMA}so the rendered text matches the source.