Groups and captures#

Grouping does two things, which are easy to confuse: it turns a sequence of pattern elements into a single unit for quantification and alternation, and it saves what that unit matched for later use.

/house(cat|keeper)/;       # 'house' followed by 'cat' or 'keeper'
/(ab){3}/;                 # 'ababab'
/(\d{3})-(\d{4})/;         # capture two groups separated by '-'

Beyond the plain (…) capturing form, Perl provides a small zoo of parenthesised constructs: non-capturing groups for clean grouping, named captures for self-documenting patterns, atomic groups for controlling backtracking, recursive subpatterns for nested structures, conditional patterns for branching on prior captures, and the branch-reset construct for parallel alternatives. This chapter covers all of them.

Capturing groups: `$1`, `$2`, …#

Every pair of unescaped parentheses in a pattern opens a capturing group. After a successful match the matched text of the nth group is in $n:

if ($time =~ /(\d\d):(\d\d):(\d\d)/) {
    my ($hours, $minutes, $seconds) = ($1, $2, $3);
}

In list context, a match returns the list of captured strings directly:

my ($h, $m, $s) = $time =~ /(\d\d):(\d\d):(\d\d)/;

If the pattern fails, the list is empty - a useful idiom for “parse or give up”:

my ($h, $m, $s) = $time =~ /(\d\d):(\d\d):(\d\d)/
    or die "not a time: $time";

There is no upper limit on the number of capture groups. Groups are numbered with the leftmost open parenthesis as group 1, the next as group 2, and so on:

/(ab(cd|ef)((gi)|j))/
  1  2      34

$1 captures the outer group, $2 the first inner, $3 the next, $4 the innermost.

Captured the empty string vs. did not match at all#

A capture group that did not participate in the match has $n undefined. A group that participated and matched an empty string has $n defined and equal to "". The distinction matters:

"aba" =~ / a (x)* b \g1 a /x;   # does NOT match
"aba" =~ / a (x)? b \g1 a /x;   # does NOT match
"aba" =~ / a (x*) b \g1 a /x;   # matches; $1 = ""
"aba" =~ / a (x?) b \g1 a /x;   # matches; $1 = ""

In the first two cases, the quantifier is outside the group, so the group itself never closed (the engine matched zero iterations - the group was not entered at all). The backreference \g1 therefore has no value to compare against and fails. In the second two cases, the quantifier is inside the group, so the group ran exactly once and captured an empty string; \g1 matches the same empty string at that position.

The lesson: when a group is optional, put the quantifier inside if you want the empty-match-counts-as-match semantics.

Always check captures with defined, not for truth - an empty capture is true-but-empty under string semantics:

if ("x" =~ /(a)?(x)/) {
    print "1 is $1\n" if defined $1;   # $1 is undef here
    print "2 is $2\n" if defined $2;
}

Failed matches do not reset capture variables#

If a match fails, $1, $2, … keep their previous values from the last successful match in the same scope. This is a feature: it lets you write a series of more specific patterns and refer to the best match’s captures afterwards.

"foo" =~ /(\w+)/ and "" =~ /(\d+)/;   # second fails
print $1;   # "foo" - first match's capture survives

It is also a common source of confusion. Always check the match’s return value before reading the capture variables.

Non-capturing groups: `(?:…)`#

If you only need the grouping for quantification or alternation, and don’t want the capture, use (?:…):

/(?:ab){3}/;            # 'ababab', no capture
/(?:\d+\.)*\d+/;        # a dotted decimal, no captures at all

Non-capturing groups are a small speed win and a larger clarity win. They signal “this grouping is for syntax, not data”. They also prevent renumbering of the capturing groups you do care about:

# match a number - $1 = whole, $2 = optional exponent value
/([+-]?\ *(?:\d+(?:\.\d*)?|\.\d+)(?:[eE]([+-]?\d+))?)/;

Without the (?:…) wrappings, $2, $3, $4 would all be set and the intended $2 (the exponent) would shift to $5.

The scoped form (?flags:…) (e.g. (?i:cat), (?xms:…)) attaches modifiers to the inner pattern only and is also non-capturing - see the modifiers chapter.

split also benefits from (?:…). split /(?:\s+)/ separates on runs of whitespace without inserting the separators into the output; split /(\s+)/ leaves them in alternating positions.

Named captures#

(?<name>…) or (?'name'…) names a group. Its match is accessible through the hash %+:

if ("2026-04-23" =~ /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/) {
    print "year = $+{year}\n";    # 2026
    print "month = $+{month}\n";  # 04
    print "day = $+{day}\n";      # 23
}

Named groups also populate $1, $2, … in the usual left-to-right order, so code that uses both conventions works. Inside the pattern, reference a named group with \k<name> (or any of the brace/quote forms below):

/(?<quote>["'])(.*?)\k<quote>/;   # same quote at start and end

Names follow Perl-identifier rules ([_A-Za-z][_A-Za-z0-9]*); they cannot begin with a digit and cannot contain hyphens.

If two distinct groups share a name, $+{name} refers to the leftmost defined group in the match. This is unusual to want outside a branch reset ((?|…)), where shared names are idiomatic.

`(?P<name>…)` - Python/PCRE compatibility#

For programmers porting from Python’s re module or PCRE, Perl accepts the Python-style spellings:

Python form	Perl equivalent
`(?P<NAME>...)`	`(?<NAME>...)`
`(?P=NAME)`	`\k<NAME>`
`(?P>NAME)`	`(?&NAME)`

The Python forms work but are not idiomatic Perl; prefer the native forms in new code.

Backreferences#

A backreference in a pattern demands that a later position match the same text an earlier group captured - not the same pattern, the same actual characters.

Form	Refers to
`\1` … `\9`	first through ninth capturing group (legacy form)
`\g1`, `\g2`	numbered capture; equivalent to `\1`, `\2`
`\g{1}`	brace form; required when digits would otherwise follow
`\g-1`	most-recently-opened capturing group (relative)
`\g{-2}`	second-most-recently-opened group
`\k<name>`	named capture
`\k'name'`	same, single-quote form
`\k{name}`	same, brace form (allows surrounding spaces)
`\g{name}`	named, alternative spelling

Examples:

# Match a three-letter word followed by a space and the same word.
"the the other day" =~ /\b(\w{3})\s\1\b/;   # $1 eq 'the'

# Match a four-letter, three-letter, two-letter, or one-letter
# run followed by itself.
/^(\w{1,4})\1$/;    # 'beriberi', 'booboo', 'coco', 'mama', 'papa'

`\g{…}` versus `\1` for disambiguation#

Use \g{…} (or \k<…>) when digits or octal-looking characters follow the reference, to avoid ambiguity:

/(\d)abc\g{1}23/;    # the '1' refers to group 1, '23' is literal
/(\d)abc\123/;       # '\123' is octal 0x53 ('S'), not group 1

Perl’s rule for the bare \N form: \1 through \9 always mean backreferences. \10, \11, … mean a backreference only if that many capture groups have opened earlier in the pattern; otherwise they are octal character literals. This is why \g{...} is the safer form when the pattern is built by concatenation.

If you use the brace form, optional surrounding spaces are permitted - \g{ -1 } and \k{ name } are valid.

Relative backreferences#

\g-1, \g{-1} refer to the immediately previous capturing group; \g{-2} refers to the one before that; and so on. The distance is counted by opened parens, including unclosed ones. This matters when a pattern fragment is interpolated inside another:

my $pair = '([a-z])(\d)\g{-1}\g{-2}';   # a11a, g22g, x33x, ...

# Embed it: outer group shifts numbering by 1, but relative
# backreferences still work:
"code=e99e" =~ /^(\w+)=$pair$/;   # matches

Without relative backreferences, this would require knowing how many groups precede $pair at every interpolation site.

Named and relative references make long patterns robust against copy-paste.

Atomic groups: `(?>…)`#

(?>…) is a non-capturing group whose contents, once matched, are committed. The engine cannot backtrack into the group on failure - only past it as a whole.

"aaab" =~ /a*ab/;       # matches: a* gives back one 'a'
"aaab" =~ /(?>a*)ab/;   # does not match: a* takes all, refuses to give

The full treatment lives in the performance chapter. The construct is documented here because its syntax is a parenthesised group; structurally it belongs in the captures chapter alongside (?:…).

Possessive quantifiers (*+, ++, ?+, {n,m}+) are exact syntactic sugar for (?>…) around the quantified atom. The following are equivalent:

Possessive	Atomic-group form
`PAT*+`	`(?>PAT*)`
`PAT++`	`(?>PAT+)`
`PAT?+`	`(?>PAT?)`
`PAT{n,m}+`	`(?>PAT{n,m})`

The long-form spelling (*atomic:…) is also accepted.

Recursive subpatterns#

Perl regexps can refer back to a capture group as if it were a subroutine call. The construct re-runs the captured pattern at the current position. This makes genuinely recursive structures matchable - balanced parentheses, S-expressions, nested brackets - without falling out of the regex DSL.

Form	Recurses to
`(?R)`	the entire pattern
`(?0)`	same as `(?R)`
`(?1)`	group 1
`(?2)`	group 2 (and so on)
`(?-1)`	most-recently-opened group (relative)
`(?+1)`	next group to be opened (forward relative)
`(?&NAME)`	named group
`(?P>NAME)`	same as `(?&NAME)` (Python-compatible)

Note: relative recursion counts unclosed groups, unlike relative backreferences. (?-1) always means the most recently opened group whether or not it has closed.

A balanced-paren matcher:

my $bal = qr/
    (?(DEFINE)
        (?<paren>
            \(                  # opening paren
            (?:
                [^()]++         # non-paren run, possessive
              | (?&paren)       # or a balanced sub-group, recursively
            )*+
            \)                  # closing paren
        )
    )
    (?&paren)
/x;

"((a)(b(c)))" =~ /^$bal$/;   # matches

The (?(DEFINE)...) block declares a named subpattern without matching anything itself; the (?&paren) after it invokes that subpattern, and the recursive (?&paren) inside the body is what gives the unbounded depth. (?R) would not work here - it recurses the whole enclosing pattern, including the ^ and $ anchors that get added at the call site, which forces every level of recursion to demand start- and end-of-string.

A more nuanced recursive pattern: match a function foo(...) where the argument may itself contain balanced parens.

my $re = qr/(            # group 1: full function call
              foo
              (          # group 2: parens with content
                \(
                  (      # group 3: contents of parens
                    (?:
                       (?> [^()]+ )   # non-paren without backtracking
                     |
                       (?2)           # recurse to group 2
                    )*
                  )
                \)
              )
            )/x;

'foo(bar(baz)+baz(bop))' =~ /$re/ and
    print "1: $1\n2: $2\n3: $3\n";
# 1: foo(bar(baz)+baz(bop))
# 2: (bar(baz)+baz(bop))
# 3: bar(baz)+baz(bop)

Capture state inside recursion#

When a group recurses into itself, captures set inside the recursion are not visible to the caller after the recursion returns. The recursive call has its own capture state, which is discarded on return. This is why most recursive patterns wrap a secondary capture group around the recursive call when the matched text is needed:

/(?<NAME>(?&NAME_PAT))(?<ADDR>(?&ADDRESS_PAT))
 (?(DEFINE)
   (?<NAME_PAT>....)
   (?<ADDRESS_PAT>....)
 )/x

Here $+{NAME} is the outer capture; $+{NAME_PAT} is undefined because it lived only inside the recursion.

The recursion-depth ceiling is compiled into the engine. Patterns that recurse without consuming input fail immediately rather than running forever - the engine detects the cycle.

`(DEFINE)` blocks#

The (?(DEFINE)…) block holds named subpatterns that never run on their own - they are only invoked through (?&NAME). This is how to write a regex that looks like a small grammar:

my $email = qr/
    \A (?&LOCAL) @ (?&DOMAIN) \z
    (?(DEFINE)
        (?<LOCAL>   [\w.+-]+ )
        (?<DOMAIN>  (?&LABEL) (?: \. (?&LABEL) )+ )
        (?<LABEL>   [a-zA-Z0-9] (?: [a-zA-Z0-9-]* [a-zA-Z0-9] )? )
    )
/x;

The DEFINE block at the end keeps the named subpatterns in one place; the top of the pattern reads as a clean specification.

Two cautions:

Subpatterns in a DEFINE block count toward the absolute and relative numbering of capture groups in the surrounding pattern. Name everything in the DEFINE so you don’t have to count.
The optimiser is less effective on DEFINE-style patterns than on the equivalent inlined form. They run, but not always at top speed.

Conditional patterns: `(?(cond)yes|no)`#

A conditional pattern picks between two sub-patterns based on a runtime test:

(?(N)YES|NO) - match YES if group N matched something, else NO.
(?(<NAME>)YES|NO) - same, but by name.
(?(?=LOOK)YES|NO) - match YES if the lookahead succeeds.
(?(?{CODE})YES|NO) - match YES if the code block returns true.
(?(R)YES|NO) - match YES if currently inside a recursion.
(?(R1)YES|NO) - match YES if recursing through group 1.
(?(R&NAME)YES|NO) - match YES if recursing through named group.

The NO branch is optional; missing NO is treated as “always match”.

Worked example: optionally-quoted text. If the input opens with (, require a closing ). Otherwise allow no parens:

m{ ( \( )?            # optional opening paren, group 1
   [^()]+             # body (no parens)
   (?(1) \) )         # if group 1 matched, require closing paren
}x;

Without conditional patterns, this would need an alternation covering both shapes; the conditional version exposes the intent.

Worked example: balanced grammar with named recursion. Useful inside a DEFINE block for context-sensitive validation:

qr/
    (?<expr>
        (?<atom>  [a-z]+ | \( (?&expr) \) )
        (?: \s* [+*] \s* (?&atom) )*
    )
/x

Conditionals are the part of regex that most resembles a real programming language; reach for them when an alternation of mutually-exclusive shapes would be unwieldy.

Branch reset: `(?|…)`#

Inside (?|…), every alternative branch starts numbering its captures at the same slot. After the group, numbering resumes one past the maximum across all branches. The full treatment is in the alternation chapter; the capture-numbering implications belong here.

# Before  ---------------branch-reset----------- after
/ ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
# 1            2         2  3        2     3     4

After this pattern matches:

$1 is always the leading a.
$2 is y, (p q r), or t depending on branch.
$3 is undef from branch 1, q from branch 2, v from branch 3.
$4 is the trailing z.

If you use named captures inside (?|…), use the same names in the same order in every branch:

/(?|  (?<a> x ) (?<b> y )
   |  (?<a> z ) (?<b> w )) /x;

Mixing names across branches works (Perl resolves to the leftmost defined name) but produces surprising results: every name references the same slot, so $+{a} and $+{b} may have the same value across different branches.

Position arrays: `@-` and `@+`#

After a successful match, @- and @+ hold the start and end offsets of the whole match and of each capture group:

$-[0], $+[0] - offsets of the whole match.
$-[n], $+[n] - offsets of the nth capture, or undef if the group did not participate.

my $s = "Mmm...donut, thought Homer";
if ($s =~ /^(Mmm|Yech)\.\.\.(donut|peas)/) {
    for my $i (1 .. $#-) {
        printf "Match %d: %s at (%d,%d)\n",
               $i,
               substr($s, $-[$i], $+[$i] - $-[$i]),
               $-[$i], $+[$i];
    }
}
# Match 1: Mmm at (0,3)
# Match 2: donut at (6,11)

Offsets are often easier than substrings when you need to modify the original string at the matched position.

`@{^CAPTURE}` - captures as an array#

Perl exposes all numbered captures as a single array @{^CAPTURE}, indexed from 0 (where index 0 is $1, index 1 is $2, …):

$string =~ /$pattern/ and my @captured = @{^CAPTURE};

This is convenient when the number of captures is variable or unknown - code that takes captures from a user-supplied pattern no longer has to count ( to know how many $1/$2/… to read.

Subscripting requires the demarcated-curly form:

print "${^CAPTURE[0]}";    # equivalent to $1

%{^CAPTURE_ALL} and %{^CAPTURE_NAMES} exist for named-capture introspection - see perlvar.

Prematch, match, postmatch#

Perl sets three special scalars after each match that expose the surrounding text:

$` - everything before the match (the pre-match).
$& - the match itself.
$' - everything after the match (the post-match).

"the cat caught the mouse" =~ /cat/;
# $`   = 'the '
# $&   = 'cat'
# $'   = ' caught the mouse'

The named variants ${^PREMATCH}, ${^MATCH}, ${^POSTMATCH} are equivalent. Both forms are zero-cost; use whichever reads better.

`$+` and `$^N`#

$+ holds the match of the highest-numbered capture group that succeeded.

$^N holds the match of the most-recently-closed capture group - the rightmost ) that completed. This is exactly what you want inside a (?{…}) code block to access the latest capture without counting parens:

$_ = "The brown fox jumps over the lazy dog";
/the (\S+)(?{ $color = $^N }) (\S+)(?{ $animal = $^N })/i;
print "color = $color, animal = $animal\n";
# color = brown, animal = fox

See the performance chapter for the full embedded- code story.

POSIX subexpression maximisation - what other engines do#

Some engines (any conformant POSIX NFA, plus engine implementations designed to mimic POSIX, plus most DFA hybrids) follow a different rule for which match wins when multiple are possible: longest-leftmost, with the constraint that each subexpression captures the longest substring it can.

For the pattern (to|top)(o|polo)?(gical|o?logical) against topological:

Perl (and PCRE2, and any Traditional NFA) tries alternatives left-to-right. (to|top) matches to; the engine pushes on. (o|polo) matches o; pushes on. (gical|o?logical) matches logical. Done.
POSIX semantics requires the overall match be longest, and among matches of equal length, each individual group’s capture to be longest. The result is top polo gical, where each capture is at its widest.

Modern engines essentially never use POSIX-NFA semantics, but some DFA-based tools (grep, awk) approximate it. The cross-engine chapter has the full table.

Cross-engine: captures and backreferences#

This is the one feature where engines diverge most. The full table is in the cross-engine chapter; the relevant rows extracted:

Feature	Perl 5.44	PCRE2	Emacs	POSIX BRE	POSIX ERE	RE2/Go
Numbered captures	yes	yes	yes	yes	strict no, GNU yes	yes (no backrefs)
Backreferences in pattern (`\1`–)	yes	yes	yes	yes (`\1`–`\9`)	strict no	NO
Named captures `(?<name>...)`	yes	yes	NO	NO	NO	yes
Branch reset `(?\|…)`	yes	yes	NO	NO	NO	NO
`@{^CAPTURE}` array	yes	NO	NO	NO	NO	NO (different API)
Recursive subpatterns `(?R)` etc.	yes	yes	NO	NO	NO	NO
Conditional patterns `(?(c)y\|n)`	yes	yes	NO	NO	NO	NO

The single most important takeaway: RE2 / Go regexp does not support backreferences at all. This is not an oversight; it is the price RE2 pays for guaranteed linear-time matching. Patterns that need backreferences are not regular languages and cannot be matched by a DFA in linear time.

If you are porting Perl regexps to a Go service, audit them for backreferences before assuming the conversion is mechanical. Patterns like /(?<a>\w+) and \k<a>/ simply cannot be expressed in RE2.

Groups and captures#

Capturing groups: $1, $2, …#

Captured the empty string vs. did not match at all#

Failed matches do not reset capture variables#

Non-capturing groups: (?:…)#

Named captures#

(?P<name>…) - Python/PCRE compatibility#

Backreferences#

\g{…} versus \1 for disambiguation#

Relative backreferences#

Atomic groups: (?>…)#

Recursive subpatterns#

Capture state inside recursion#

(DEFINE) blocks#

Conditional patterns: (?(cond)yes|no)#

Branch reset: (?|…)#

Position arrays: @- and @+#

@{^CAPTURE} - captures as an array#