--- name: regex groups and captures --- # Groups and captures Grouping does two things, which are easy to confuse: it turns a sequence of pattern elements into a single unit for quantification and alternation, and it saves what that unit matched for later use. ```perl /house(cat|keeper)/; # 'house' followed by 'cat' or 'keeper' /(ab){3}/; # 'ababab' /(\d{3})-(\d{4})/; # capture two groups separated by '-' ``` ## Capturing groups: $1, $2, … Every pair of unescaped parentheses in a pattern opens a capturing group. After a successful match the matched text of the nth group is in `$n`: ```perl if ($time =~ /(\d\d):(\d\d):(\d\d)/) { my ($hours, $minutes, $seconds) = ($1, $2, $3); } ``` In list context, a match returns the list of captured strings directly: ```perl my ($h, $m, $s) = $time =~ /(\d\d):(\d\d):(\d\d)/; ``` If the pattern fails, the list is empty — a useful idiom for "parse or give up": ```perl my ($h, $m, $s) = $time =~ /(\d\d):(\d\d):(\d\d)/ or die "not a time: $time"; ``` Nested groups are numbered by the position of their opening `(`, in left-to-right order: ``` /(ab(cd|ef)((gi)|j))/ 1 2 34 ``` `$1` captures the outer group, `$2` the first inner, `$3` the next, `$4` the innermost. Unset capture groups — ones that did not participate in the match — have `$n` undefined. Check with `defined`, not truth: ```perl if ("x" =~ /(a)?(x)/) { print "1 is $1\n" if defined $1; # $1 is undef here print "2 is $2\n" if defined $2; } ``` ## Non-capturing groups If you only need the grouping for quantification or alternation, and don't want the capture, use `(?:…)`: ```perl /(?:ab){3}/; # 'ababab', no capture /(?:\d+\.)*\d+/; # a dotted decimal, no captures at all ``` Non-capturing groups are a small speed win and a larger clarity win. They signal "this grouping is for syntax, not data". They also prevent renumbering of the capturing groups you do care about: ```perl # match a number — $1 = whole, $2 = optional exponent value /([+-]?\ *(?:\d+(?:\.\d*)?|\.\d+)(?:[eE]([+-]?\d+))?)/; ``` Without the `(?:…)` wrappings, `$2`, `$3`, `$4` would all be set and the intended `$2` (the exponent) would shift to `$5`. Split also benefits from `(?:…)`. `split /(?:\s+)/` separates on runs of whitespace without inserting the separators into the output; `split /(\s+)/` leaves them in alternating positions. ## Named captures `(?…)` or `(?'name'…)` names a group. Its match is accessible through the hash `%+`: ```perl if ("2026-04-23" =~ /(?\d{4})-(?\d{2})-(?\d{2})/) { print "year = $+{year}\n"; # 2026 print "month = $+{month}\n"; # 04 print "day = $+{day}\n"; # 23 } ``` Named groups also populate `$1`, `$2`, … in the usual left-to-right order, so code that uses both conventions works. Inside the pattern itself, reference a named group with `\k` (or `\k'name'`): ```perl /(?["'])(.*?)\k/; # same quote at start and end ``` ## Backreferences A backreference in a pattern demands that a later position match the *same text* an earlier group captured — not the same pattern, the same actual characters. | Form | Refers to | |------------|-----------------------------------------------------| | `\1` | first capturing group | | `\g1` | same as `\1` (prefer `\g1` when digits follow) | | `\g{1}` | braces disambiguate | | `\g-1` | immediately previous capturing group (relative) | | `\g{-2}` | second-previous capturing group | | `\k` | named capturing group | | `\k{name}` | same, alternate brace form | Examples: ```perl # Match a three-letter word followed by a space and the same word. "the the other day" =~ /\b(\w{3})\s\1\b/; # $1 eq 'the' # Match a four-letter, three-letter, two-letter, or one-letter # run followed by itself. /^(\w{1,4})\1$/; # 'beriberi', 'booboo', 'coco', 'mama', 'papa' ``` Use `\g{…}` when digits follow the reference to avoid ambiguity: ```perl /(\d)abc\g{1}23/; # the '1' refers to group 1, '23' is literal /(\d)abc\123/; # '\123' is octal 0x53 ('S'), not group 1 ``` Relative backreferences (`\g-1`, `\g{-2}`) refer to the *nth-most recently opened* group. They survive when the pattern is embedded inside another that adds outer groups in front: ```perl my $pair = '([a-z])(\d)\g{-1}\g{-2}'; # a11a, g22g, x33x, ... # Embed it: outer group shifts numbering by 1, but relative # backreferences still work: "code=e99e" =~ /^(\w+)=$pair$/; # matches ``` Named and relative references make long patterns robust against cut-and-paste. ## The position arrays: @- and @+ After a successful match, `@-` and `@+` hold the start and end offsets of the whole match and of each capture group: - `$-[0]`, `$+[0]` — offsets of the whole match. - `$-[n]`, `$+[n]` — offsets of the nth capture, or undef if the group did not participate. ```perl my $s = "Mmm...donut, thought Homer"; if ($s =~ /^(Mmm|Yech)\.\.\.(donut|peas)/) { for my $i (1 .. $#-) { printf "Match %d: %s at (%d,%d)\n", $i, substr($s, $-[$i], $+[$i] - $-[$i]), $-[$i], $+[$i]; } } # Match 1: Mmm at (0,3) # Match 2: donut at (6,11) ``` Offsets are often easier than substrings when you need to modify the original string at the matched position. ## Prematch, match, postmatch Perl sets three special scalars after each match that expose the surrounding text: - `` $` `` — everything before the match (the *pre-match*). - `$&` — the match itself. - `` $' `` — everything after the match (the *post-match*). ```perl "the cat caught the mouse" =~ /cat/; # $` = 'the ' # $& = 'cat' # $' = ' caught the mouse' ``` On modern Perl these carry no performance penalty. Older guidance said to avoid them; that guidance no longer applies. The named variants `${^PREMATCH}`, `${^MATCH}`, `${^POSTMATCH}` also exist and are set whether or not the `/p` modifier is present (the `/p` modifier itself has been a no-op since Perl 5.20). ## Alternative numbering across branches: (?|…) Parallel alternatives sometimes want to capture into the same numbered slots regardless of which branch matches. `(?|…)` reuses the same group numbers across each alternative inside it: ```perl if ($time =~ /(?|(\d\d|\d):(\d\d)|(\d\d)(\d\d))\s+([A-Z]{3})/) { # $1 is hour from whichever branch matched # $2 is minute from whichever branch matched # $3 is zone, numbered after both branches print "hour=$1 minute=$2 zone=$3\n"; } ``` Without `(?|…)` you would have to check `$1`-vs-`$3` or use named captures. It is covered further in the alternation chapter. ## $+ and $^N `$+` holds the match of the highest-numbered capture group that succeeded. `$^N` holds the match of the most-recently-closed capture group (rightmost `)` that completed), which is the one you want inside a `(?{…})` code assertion. ## Summary - `(…)` captures; use `$1`, `$2`, … or `@-`, `@+`. - `(?:…)` groups without capturing; prefer this when you don't need the match. - `(?…)` names the capture; access via `%+`, reference with `\k`. - Backreferences are `\1`, `\g{1}`, `\g{-1}`, `\k`. - Prematch, match, postmatch in `` $` ``, `$&`, `` $' ``. ## See also - [`perlre`](../../p5/core/perlre) — the full capture semantics including the `\K` (keep) assertion. - [`perlvar`](../../p5/core/perlvar) — the full list of capture-related special variables, including `%+`, `%-`, `$&`, `$^N`. - The [alternation](alternation) chapter — `(?|…)` and branch reset.