Anchors and assertions#

Anchors and assertions match positions rather than characters. They test where in the string you are, not what is there. A plain character class consumes one character; an anchor consumes none.

They are called zero-width assertions for that reason — they have width zero, but they assert a property must hold.

Start and end of string#

^ matches at the start of the string. $ matches at the end, or just before a final newline.

"housekeeper" =~ /keeper/;    # matches — 'keeper' appears somewhere
"housekeeper" =~ /^keeper/;   # does not match — not at start
"housekeeper" =~ /keeper$/;   # matches — at end
"housekeeper\n" =~ /keeper$/; # matches — before the trailing newline

Used together, ^…$ forces the pattern to account for the whole string:

"bert"    =~ /^bert$/;     # matches
"bertram" =~ /^bert$/;     # does not match
"dilbert" =~ /^bert$/;     # does not match
""        =~ /^$/;         # matches — empty string

For a literal string, $s eq "bert" is faster and clearer. ^…$ only earns its keep once the middle uses real regexp features.

Absolute versus line-relative anchors#

Under the /m modifier, ^ and $ anchor at every line boundary inside the string, not just the outer ends:

my $x = "There once was a girl\nWho programmed in Perl\n";

$x =~ /^Who/;    # does not match — 'Who' is not at string start
$x =~ /^Who/m;   # matches — 'Who' is at start of second line

For the absolute ends regardless of /m, use the dedicated anchors:

  • \A — absolute start of string.

  • \z — absolute end of string.

  • \Z — end of string, or just before a final newline. Similar to $ but unaffected by /m.

$x =~ /^Who/m;    # matches, as above
$x =~ /\AWho/m;   # does not match — \A is string start, always

$x =~ /girl$/m;   # matches, end of first line
$x =~ /girl\Z/m;  # does not match — 'girl' is not at string end
$x =~ /Perl\Z/m;  # matches — end or before final newline
$x =~ /Perl\z/m;  # does not match — \z is strict end

Rule of thumb: \A and \z when you mean the whole string; ^ and $ when you mean lines (usually with /m). \A and \z are also faster: the engine knows they constrain position, and skips bump-along retries that ^ permits.

The four newline modes can be summarised in a table. Choose the one matching what ^, $, and . should mean for your input:

Mode

^ matches at

$ matches at

. matches \n?

default

start of string only

end of string (and before final \n)

no

/m

start of every line

end of every line

no

/s

start of string only

end of string (and before final \n)

yes

/sm

start of every line

end of every line

yes

\A, \Z, \z are unaffected by either modifier.

Word boundaries#

\b matches a position between a word character (\w) and a non-word character (\W), or between either and the edge of the string. It does not consume any character.

my $x = "Housecat catenates house and cat";

$x =~ /cat/;      # matches 'cat' inside 'Housecat'
$x =~ /\bcat/;    # matches 'cat' inside 'catenates'
$x =~ /cat\b/;    # matches 'cat' inside 'Housecat'
$x =~ /\bcat\b/;  # matches the standalone 'cat' at end

\B is the negation — a position that is not a word boundary.

\b outside […] is a word boundary. \b inside […] is the backspace character, \x08. The dual meaning is the single most common trap in the regex syntax. Always read \b’s context before reading what it does.

\b and \W-starting items — the $3.75 gotcha#

\b requires a transition between word and non-word characters. If both sides of the position are non-word characters, there is no boundary. This catches people building patterns by interpolation:

my $item = '$3.75';
my $regex = qr/\b\Q$item\E\b/;

"is $3.75 plus tax" =~ /$regex/;   # does NOT match

Why? After \Q$item\E cooks, the pattern is \b\$3\.75\b. The engine looks for a position immediately before $ where one side of the position is a word character. The character before $ in the input is a space (non-word), and $ itself is also non-word. There is no word/non-word transition; \b fails.

To anchor the start of $item regardless of its first character:

my $regex = qr/(?:^|\s)\Q$item\E(?:\s|$)/;

Use string-edge or whitespace anchors when the interpolated content might begin with a non-word character. \b is for contexts where you know the surrounded text is word-like.

Unicode word-boundary variants#

Plain \b follows the \w/\W definition, which under Unicode covers letters, digits, marks, and connector punctuation. Real text — natural language with apostrophes, hyphens, numerics with commas — wants more nuance. Perl provides four semantic boundary variants:

Boundary

Meaning

\b{wb}

Unicode word boundary — handles don't, state-of-the-art

\b{sb}

sentence boundary

\b{lb}

line break (suitable for line wrapping)

\b{gcb}

grapheme-cluster boundary (same effect as \X)

"don't" =~ /.+?\b{wb}/x;   # matches whole word — apostrophe is inside
"don't" =~ /.+?\b/x;       # stops at the apostrophe — plain \b splits

For natural-language processing, \b{wb} and \b{sb} are almost always what you want. Plain \b is for ASCII identifier contexts.

\G — the last-match position#

\G anchors at the position where the previous successful /g match ended on the same string. It is what makes tokenisation with /g in while loops robust.

my $s = "12abc34";
while ($s =~ /\G(\d+|[a-z]+)/gc) {
    print "got '$1'\n";
}
# prints: '12', 'abc', '34'

Without \G, the same pattern would re-scan from wherever it first matched, skipping over characters that didn’t fit — quietly losing data. The /gc modifier preserves the position on failure; covered in the modifiers chapter.

Why \G matters: the synch-or-skip choice#

Friedl’s worked example. Find every five-digit ZIP code beginning with 44 in a run-together string:

my $s = "06192054410-44272-13901-44106-22134";
while ($s =~ /44(\d{3})/g) {
    print "got 44$1\n";
}

Without \G, the engine’s bump-along is happy to find 44272 and 44106 correctly and also find 44272 from a different start position whenever a match earlier was rejected. The output is “right answer plus extra wrong ones”.

With \G on every iteration, the engine cannot bump-along on failure — it must resume exactly where the last match ended. A failure terminates the loop:

while ($s =~ /\G\D*(44\d{3})/gc) {
    print "got $1\n";
}

\G effectively disables the bump-along, which is what synchronised tokenisation demands. The general rule: if the correctness of your loop depends on every match being adjacent to the last, you need \G.

\G reads the value pos($string) returns. Setting pos explicitly resets the anchor. On the first iteration of a /g loop (or any pattern not yet matched against this string), \G is equivalent to \A.

Lookahead#

Lookahead checks what comes next without consuming it. Positive lookahead is (?=…):

my $x = "I catch the housecat 'Tom-cat' with catnip";

$x =~ /cat(?=\s)/;   # matches 'cat' in 'housecat' — space follows
$x =~ /cat(?!\s)/;   # matches 'cat' in 'catch' — no space follows

(?!…) is negative lookahead: match only if the inner pattern does not apply. Zero-width, just like ^ and $.

"foobar" =~ /foo(?!bar)/;   # does not match
"foobaz" =~ /foo(?!bar)/;   # matches

Lookahead worked example: digits not followed by a period#

A natural-sounding question that is harder than it looks: “match runs of digits not followed by a period.”

The first try: \d+(?!\.). Apply it to OH 44272:

"OH 44272" =~ /\d+(?!\.)/;   # MATCHES, matches '44272'

That looks right, but for the wrong reason. Greedy \d+ first grabbed the entire digit run 44272; the next character is end-of-string (not .), so (?!\.) succeeded straight away — the lookahead happened to hold without the engine ever needing to back off. The intuitive reading “the run of digits ends with something other than a period” is satisfied for the trailing-digits case, but the next example shows what happens when it doesn’t.

Now apply it to 4423.45:

"4423.45" =~ /\d+(?!\.)/;   # matches '442', not '4423'

The engine’s first try: \d+ matches 4423. (?!\.) checks the next character: it is ., the negative lookahead fails. The engine backtracks: \d+ matches 442, the next character is 3, not ., success. The pattern matches 442not the full digit run before the period. The greedy quantifier was willing to back off to make the lookahead succeed.

The fix: require the lookahead to also reject other digits, so the run cannot back off into “digit followed by digit”.

"4423.45" =~ /\d+(?![\d.])/;   # matches '45' — correct

\d+(?![\d.]) reads as “match a run of digits not followed by either another digit or a period”. The greedy quantifier still expands maximally, but the lookahead now refuses to succeed in the middle of a digit run, so the only matching positions are the actual digit-run ends.

The lesson generalises: a lookahead that excludes only one character can be defeated by a greedy quantifier backing off. Always include the characters of the atom the quantifier matches in the lookahead’s negation.

A related case: leading negative lookahead and the bump-along.

"cattle" =~ /(?!cat)\w+/;   # MATCHES, captures 'cattle'

The lookahead (?!cat) rejects position 0 (where cat would match). The engine bumps along: position 1 starts with attle, where (?!cat) succeeds, and \w+ matches attle.

To force “first word that doesn’t begin with cat”, anchor the lookahead to a word boundary:

"cattle" =~ /\b(?!cat)\w+/;   # does not match

Now the lookahead applies at every word boundary, which in cattle is only position 0.

Lookbehind#

Lookbehind checks what came before. Positive lookbehind is (?<=…):

"cats and dogs" =~ /(?<=and )dogs/;   # matches 'dogs' after 'and '

Negative lookbehind is (?<!…):

"prefix_foo suffix_foo" =~ /(?<!prefix_)foo/;
# matches 'foo' in 'suffix_foo'

Modern Perl supports variable-length lookbehind from 1 to 255 characters. Patterns like (?<=cat|kitten) are fine. Patterns where the lookbehind itself contains capturing groups produce an experimental warning (the captures’ contents are not fully-defined when the lookbehind has variable length).

Because the lookbehind must finish before the current position, its maximum length is limited to 255 characters under default semantics. Under /i, a few characters fold to multi-character sequences (ß to ss), which counts the expanded length toward the 255 limit — so a 127-character lookbehind containing ß is fine, but 128 such characters is not.

A practical workaround for longer lookbehind: use \K (next section), which has no length limit.

\K — keep-left#

\K is the lookbehind alternative. Everything matched before \K is excluded from $&, but is required to have matched. Pragmatically: “match this prefix, but don’t include it in the output.”

"feed the cat" =~ /the \Kcat/;   # matches; $& is 'cat'

Equivalent to (?<=the )cat, but:

  • No length limit. \K works after any prefix.

  • Often substantially faster than (?<=…), because the engine does not need to look backward — it forgets the start of the match instead.

  • Especially useful in substitution: s/foo\Kbar/QUUX/ is cleaner than s/(foo)bar/$1QUUX/.

\K only forgets $& and @-/@+[0]; capture groups before \K are still set as captured. The construct may appear inside other lookarounds, though the behaviour there is described as “currently not well defined” — the conservative use is at the top level of a match or substitution.

Long-form lookaround aliases#

Each lookaround construct has a verb-style spelling. Long-form aliases read more clearly in patterns that already use (*VERB:…) for backtracking control:

Standard form

Verb form (short)

Verb form (long)

(?=…)

(*pla:…)

(*positive_lookahead:…)

(?!…)

(*nla:…)

(*negative_lookahead:…)

(?<=…)

(*plb:…)

(*positive_lookbehind:…)

(?<!…)

(*nlb:…)

(*negative_lookbehind:…)

The verb forms are exactly equivalent to the standard forms. They are accepted; they are not idiomatic in handwritten Perl. You will see them in autogenerated patterns and PCRE2 ports.

Combining lookaround with split#

Anchors and lookaround let split separate a string on invisible positions rather than visible characters:

my $str = "one two - --6-8";
my @toks = split / \s+            # whitespace
                 | (?<=\S) (?=-)  # non-space followed by '-'
                 | (?<=-)  (?=\S) # '-' followed by non-space
                 /x, $str;
# @toks = ("one", "two", "-", "-", "-", "6", "-", "8")

The second and third alternatives match between characters — no consumption. They are legal split separators because split is happy to split on a zero-width match.

Two zero-width assertions juxtaposed are AND-ed#

A small but illuminating fact: when two lookarounds appear next to each other in a pattern, both must hold at the same position. This works as expected:

$x =~ /^(\D*)(?=\d)(?!123)/;
# Matches the leading non-digit run, but only if a digit follows
# AND that digit run does not begin with '123'.

Both (?=\d) and (?!123) apply at the same position; the engine treats their conjunction as the requirement. Without (?=\d), the negative lookahead (?!123) is satisfiable by any non-123 continuation — including end-of-string or a non-digit, neither of which is what the author meant.

This generalises Friedl’s framing: juxtaposition in a regex always means AND, except when written with |. /ab/ means “a AND (then) b”, just as /^$/ means “start AND end” — except the AND is across positions for ab and at one position for the lookarounds.

Cross-engine: anchors and lookaround#

The cross-engine chapter has the full table; the relevant rows extracted.

Anchors and newline handling#

Concern

Perl 5.42

PCRE2

Emacs

POSIX BRE / ERE

RE2 / Go

^ start-of-string

yes

yes

start-of-buffer

yes

yes

^ after \n under multi-line

/m

(?m)

always (line-oriented)

not specified

(?m)

$ end-of-string

yes

yes

end-of-buffer

yes

yes

$ before final \n

yes

yes

no

varies

yes

\A / \z

yes

yes

\` / \'

no

yes

\Z

yes

yes

no

no

no (use \z)

\G

yes

yes

no

no

no

Lookaround and atomic constructs#

Feature

Perl 5.42

PCRE2

Emacs

POSIX

RE2 / Go

Lookahead (?=…) / (?!…)

yes

yes

NO

NO

NO

Fixed-width lookbehind

yes

yes

NO

NO

NO

Variable-length lookbehind

yes (exp.)

yes

NO

NO

NO

\K

yes

yes

NO

NO

NO

Atomic group (?>…)

yes

yes

NO

NO

NO

Effectively, only PCRE2 (and other Perl-derived engines outside this table — Java, Python, .NET) supports lookaround. POSIX tools and RE2 / Go simply do not have it. Patterns that rely on lookaround do not port across this divide.

Summary#

Anchor

Matches

^

start of string, or start of line under /m

$

end of string (before trailing \n), or line under /m

\A

absolute start of string

\z

absolute end of string

\Z

end of string or before trailing \n, ignores /m

\b

word boundary

\B

non-word-boundary

\b{wb}

Unicode word boundary

\b{sb}

sentence boundary

\b{lb}

line break

\b{gcb}

grapheme cluster boundary

\G

end of previous /g match (or \A if none)

(?=…)

positive lookahead

(?!…)

negative lookahead

(?<=…)

positive lookbehind

(?<!…)

negative lookbehind

\K

keep-left (forget the prefix from $&)

See also#

  • The modifiers chapter — /m, /s, /g, /c.

  • The unicode chapter — \b{wb}, \b{sb}, \b{lb}, \b{gcb} and what they mean.

  • The performance chapter — atomic groups (?>…), special backtracking control verbs.

  • The cross-engine chapter — full anchor and lookaround compatibility table.

  • split — splitting at zero-width positions.

  • pos — read or set the position \G anchors at.