Alternation#

Alternation is the | operator. It picks between two or more sub-patterns at the same position.

"cats and dogs" =~ /cat|dog|bird/;    # matches 'cat'
"cats and dogs" =~ /dog|cat|bird/;    # matches 'cat'

The order of the alternatives does not change where the overall pattern matches. The engine still honours “earliest position wins” — both patterns above match at position 0 because that’s the earliest position where any alternative can match.

Leftmost alternative wins at a given position#

Within a single starting position, alternatives are tried left to right and the first one that succeeds is used:

"cats" =~ /c|ca|cat|cats/;     # matches 'c' — first alternative wins
"cats" =~ /cats|cat|ca|c/;     # matches 'cats' — first wins, longer

If one alternative is a prefix of another and you want the longer match, put it first. The engine does not look past the first successful alternative at the current position.

An implication: on a complex pattern, reorder alternatives by likelihood and specificity. Rare, specific alternatives first; broad catch-alls last.

Grouping vs. alternation precedence#

| has very low precedence. It splits the pattern at the outermost level containing it:

/ab|cd/;       # 'ab' OR 'cd'
/^ab|cd$/;     # '^ab' OR 'cd$' — probably not what you meant!
/^(ab|cd)$/;   # '^' + ('ab' or 'cd') + '$' — what you meant

To constrain alternation to part of a pattern, wrap it in a group. Non-capturing (?:…) is preferred unless you need the capture:

/house(?:cat|keeper)/;      # 'housecat' or 'housekeeper'
/house(cat|keeper)/;        # same, but $1 will be 'cat' or 'keeper'

The group creates a local scope for |. Outside the group | resumes its top-level role:

/^(?:foo|bar|baz)$|^xyz$/;   # ('foo'/'bar'/'baz') or 'xyz'

Empty alternatives#

An empty alternative matches the empty string — a useful trick for “this or nothing”:

/house(cat|)/;          # 'housecat' or 'house'
/(19|20|)\d\d/;         # '19xx', '20xx', or just 'xx'

Modern style prefers (?:…)? over (?:…|); they are equivalent, but the ? form is clearer:

/house(?:cat)?/;        # same as house(cat|), no capture

Watch for the backtracking cost when an empty alternative is combined with a quantifier — the engine can re-explore the same position many times. See the performance chapter.

Alternation inside character classes#

Character classes are almost always what you want when alternating between single characters. /a|b|c/ and /[abc]/ match the same strings, but [abc] is faster, terser, and clearer:

/a|b|c/;      # works, but verbose
/[abc]/;      # use this

Alternation is for alternatives longer than one character (or ones that are themselves patterns). When each alternative is a single character, reach for a class.

Alternation and capturing#

Only one alternative inside a group can match at a time, so the group captures the matching alternative:

if ("bert" =~ /(cat|dog|bert|ernie)/) {
    print "matched $1\n";        # matched bert
}

Sibling groups outside the alternation retain their normal numbering:

/^(\w+):\s*(yes|no|maybe)$/;
# $1 = the key, $2 = the verdict

Inside a nested alternation, the groups are numbered left to right by opening paren, even across branches:

/(a)|(b)/;
# On match of 'a': $1 = 'a', $2 undef
# On match of 'b': $1 undef,  $2 = 'b'

Check with defined $n, not truth — an empty capture is different from an absent capture.

Branch reset: (?|…)#

Parallel-capture patterns are the usual reason you pick up (?|…). Inside (?|…), every branch starts numbering its captures at the same slot. After the group, numbering resumes at one past the maximum across all branches.

# Without (?|…): need to know which branch matched.
if ($time =~ /(\d\d|\d):(\d\d)|(\d\d)(\d\d)/) {
    my ($h, $m) = ($1, $2);
    ($h, $m) = ($3, $4) unless defined $h;
}

# With (?|…): $1 and $2 come from whichever branch matched.
if ($time =~ /(?|(\d\d|\d):(\d\d)|(\d\d)(\d\d))/) {
    my ($h, $m) = ($1, $2);
}

With a trailing fixed piece:

if ($time =~ /(?|(\d\d|\d):(\d\d)|(\d\d)(\d\d))\s+([A-Z]{3})/) {
    # $1 = hours, $2 = minutes, $3 = zone (numbered after the group)
    print "hour=$1 minute=$2 zone=$3\n";
}

Rules inside (?|…):

  • Each branch independently numbers its capturing groups from the current group count.

  • After the group, the outer numbering continues at one higher than the maximum count reached in any branch.

  • Named groups keep their names; you can repeat a name across branches.

Branch reset is the cleanest way to express “parse X in one of several equivalent formats, then reach for the same variables afterwards.”

Alternation in split#

split takes a regexp pattern, so alternation works there too:

my @words = split /\s+|-/, "one-two three  four-five";
# ('one', 'two', 'three', 'four', 'five')

If the separator pattern contains capturing groups, split includes the captured text in the output list — often surprising. Use (?:…) unless you want that:

split /(?:\s+|-)/, "a-b c";   # ('a', 'b', 'c')
split /(\s+|-)/,   "a-b c";   # ('a', '-', 'b', ' ', 'c')

Summary#

  • | separates alternatives; leftmost that matches at the current position wins.

  • Wrap alternations in (?:…) to localise them.

  • Prefer character classes over single-character alternations.

  • (?|…) resets capture numbering across branches — use it when the branches capture the same conceptual fields.

See also#

  • perlre — complete alternation semantics.

  • split — alternation as separator syntax.

  • The groups and captures chapter — interaction with capture numbering.