Alternation#

Alternation is the | operator. It picks between two or more sub-patterns at the same position.

"cats and dogs" =~ /cat|dog|bird/;    # matches 'cat'
"cats and dogs" =~ /dog|cat|bird/;    # matches 'cat'

The order of the alternatives does not change where the overall pattern matches. The engine still honours «earliest position wins» — both patterns above match at position 0 because that’s the earliest position where any alternative can match.

Leftmost alternative wins at a given position#

Within a single starting position, alternatives are tried left to right and the first one that succeeds is used:

"cats" =~ /c|ca|cat|cats/;     # matches 'c' — first alternative wins
"cats" =~ /cats|cat|ca|c/;     # matches 'cats' — first wins, longer

If one alternative is a prefix of another and you want the longer match, put it first. The engine does not look past the first successful alternative at the current position.

This is Traditional NFA behaviour, and it is what Perl, PCRE2, Python, and most modern engines implement. POSIX-conformant engines (notably some awk and grep implementations) follow the longest-leftmost rule instead — they would match cats regardless of alternative order. The cross-engine chapter has the comparison.

Friedl puts it succinctly:

Greedy alternation is non-greedy in a Traditional NFA.

The pattern tour|to|tournament against three tournaments won matches tour, not tournament. The first alternative succeeds and the engine commits to it; longer alternatives further along the list are never tried.

Implications for ordering#

An alternation that is part of a larger pattern can have its performance and correctness shaped by the order of alternatives:

  • Specificity first. When you want the longest of several prefixes, put the longest first. /web|website|websites/ matches web even on input website; /websites|website|web/ matches websites.

  • Common case first. Alternation is tried left-to-right; if 90% of your inputs hit alternative 3, the engine wastes time on alternatives 1 and 2 every time. Reorder by frequency.

  • Sibling captures. When the alternatives capture, the order affects which $n is set — see Alternation and capturing below.

Combining-pieces formal rule#

perlre’s «Combining RE Pieces» gives the precise statement underlying «leftmost wins». For two pattern pieces S and T:

When S can match, it is a better match than when only T can match.

That is the formal version of the rule. «Better» means the engine prefers it. For two S matches, the same internal ordering applies (greediness rules within the alternative); likewise for T matches. Across alternatives, the existence of a successful S match excludes consideration of T.

This is why S|T cannot be reordered by the engine on its own: Perl does not search for the best alternative across the disjunction — it commits to S whenever S succeeds.

Grouping vs. alternation precedence#

| has very low precedence. It splits the pattern at the outermost level containing it:

/ab|cd/;       # 'ab' OR 'cd'
/^ab|cd$/;     # '^ab' OR 'cd$' — probably not what you meant!
/^(ab|cd)$/;   # '^' + ('ab' or 'cd') + '$' — what you meant

To constrain alternation to part of a pattern, wrap it in a group. Non-capturing (?:…) is preferred unless you need the capture:

/house(?:cat|keeper)/;      # 'housecat' or 'housekeeper'
/house(cat|keeper)/;        # same, but $1 will be 'cat' or 'keeper'

The group creates a local scope for |. Outside the group | resumes its top-level role:

/^(?:foo|bar|baz)$|^xyz$/;   # ('foo'/'bar'/'baz') or 'xyz'

Empty alternatives#

An empty alternative matches the empty string — a useful trick for «this or nothing»:

/house(cat|)/;          # 'housecat' or 'house'
/(19|20|)\d\d/;         # '19xx', '20xx', or just 'xx'

Modern style prefers (?:…)? over (?:…|); they are equivalent, but the ? form is clearer:

/house(?:cat)?/;        # same as house(cat|), no capture

Watch for the backtracking cost when an empty alternative is combined with a quantifier — the engine can re-explore the same position many times. See the performance chapter on zero-length-match termination.

Cross-engine note#

Strict POSIX, lex, and most older awk implementations disallow empty alternatives — (this|that|) is a syntax error. Perl 5.42, PCRE2, Rust’s regex crate, Python re, and modern engine implementations accept them. If a pattern needs to be portable to older POSIX tools, write (?:this|that)? instead — it expresses the same idea and is portable across the engines that accept neither form.

Alternation inside character classes#

Character classes are almost always what you want when alternating between single characters. /a|b|c/ and /[abc]/ match the same strings, but [abc] is faster, terser, and clearer:

/a|b|c/;      # works, but verbose
/[abc]/;      # use this

Alternation is for alternatives longer than one character (or ones that are themselves patterns). When each alternative is a single character, reach for a class. The engine treats a class as a single atomic choice; an alternation as N choices to be explored in order.

Common-prefix factoring#

A pattern like /this|that|then|those/ defeats the engine’s fixed-string-check optimisation: there is no literal prefix the engine can scan for cheaply. Refactoring to expose the common prefix turns the alternation into something the optimiser can work with:

/this|that|then|those/;       # no common prefix visible
/th(?:is|at|en|ose)/;         # common prefix 'th' exposed

The two patterns match the same strings, and the second is materially faster on large inputs. The engine scans for th using Boyer-Moore, then runs the small alternation only at candidate positions.

The general technique:

  1. Find the longest literal prefix common to all alternatives.

  2. Lift it outside the alternation.

  3. Wrap the remainder in (?:…) so the alternation stays localised.

For lists of words with a few different prefixes, you can apply the rewrite recursively. For very long lists, see the performance chapter’s section on matching many strings — once the list is in the thousands, a specialised matcher beats any alternation.

Alternation and capturing#

Only one alternative inside a group can match at a time, so the group captures the matching alternative:

if ("bert" =~ /(cat|dog|bert|ernie)/) {
    print "matched $1\n";        # matched bert
}

Sibling groups outside the alternation retain their normal numbering:

/^(\w+):\s*(yes|no|maybe)$/;
# $1 = the key, $2 = the verdict

Inside a nested alternation, the groups are numbered left to right by opening paren, even across branches:

/(a)|(b)/;
# On match of 'a': $1 = 'a', $2 undef
# On match of 'b': $1 undef,  $2 = 'b'

Check with defined $n, not truth — an empty capture is different from an absent capture.

Branch reset: (?|…)#

Parallel-capture patterns are the usual reason you pick up (?|…). Inside (?|…), every branch starts numbering its captures at the same slot. After the group, numbering resumes at one past the maximum across all branches.

# Without (?|…): need to know which branch matched.
if ($time =~ /(\d\d|\d):(\d\d)|(\d\d)(\d\d)/) {
    my ($h, $m) = ($1, $2);
    ($h, $m) = ($3, $4) unless defined $h;
}

# With (?|…): $1 and $2 come from whichever branch matched.
if ($time =~ /(?|(\d\d|\d):(\d\d)|(\d\d)(\d\d))/) {
    my ($h, $m) = ($1, $2);
}

With a trailing fixed piece:

if ($time =~ /(?|(\d\d|\d):(\d\d)|(\d\d)(\d\d))\s+([A-Z]{3})/) {
    # $1 = hours, $2 = minutes, $3 = zone (numbered after the group)
    print "hour=$1 minute=$2 zone=$3\n";
}

Rules inside (?|…):

  • Each branch independently numbers its capturing groups from the current group count.

  • After the group, the outer numbering continues at one higher than the maximum count reached in any branch.

  • Named groups keep their names; you can repeat a name across branches. Use the same names in the same order in every branch, or surprises ensue (see the groups and captures chapter).

Branch reset is the cleanest way to express «parse X in one of several equivalent formats, then reach for the same variables afterwards.»

Alternation in split#

split takes a regexp pattern, so alternation works there too:

my @words = split /\s+|-/, "one-two three  four-five";
# ('one', 'two', 'three', 'four', 'five')

If the separator pattern contains capturing groups, split includes the captured text in the output list — often surprising. Use (?:…) unless you want that:

split /(?:\s+|-)/, "a-b c";   # ('a', 'b', 'c')
split /(\s+|-)/,   "a-b c";   # ('a', '-', 'b', ' ', 'c')

The capturing form is occasionally what you want — preserving the exact separators between fields. Mostly you want non-capturing.

Summary#

  • | separates alternatives; leftmost that matches at the current position wins.

  • Wrap alternations in (?:…) to localise them.

  • Prefer character classes over single-character alternations.

  • Lift common prefixes outside the alternation when you want the engine’s literal-scan optimisation to fire.

  • (?|…) resets capture numbering across branches — use it when the branches capture the same conceptual fields.

See also#

  • The groups and captures chapter — capture numbering inside (?|…), and the rest of the named- capture machinery.

  • The performance chapter — ordering by likelihood, common-prefix factoring, and matching many strings.

  • The cross-engine chapter — alternation syntax differences across engine families.

  • split — alternation as separator syntax.