Anchors and assertions#

Anchors and assertions match positions rather than characters. They test where in the string you are, not what is there. A plain character class consumes one character; an anchor consumes none.

They are called zero-width assertions for that reason - they have width zero, but they assert a property must hold.

Start and end of string#

^ matches at the start of the string. $ matches at the end, or just before a final newline.

"housekeeper" =~ /keeper/;    # matches - 'keeper' appears somewhere
"housekeeper" =~ /^keeper/;   # does not match - not at start
"housekeeper" =~ /keeper$/;   # matches - at end
"housekeeper\n" =~ /keeper$/; # matches - before the trailing newline

Used together, ^…$ forces the pattern to account for the whole string:

"bert"    =~ /^bert$/;     # matches
"bertram" =~ /^bert$/;     # does not match
"dilbert" =~ /^bert$/;     # does not match
""        =~ /^$/;         # matches - empty string

For a literal string, $s eq "bert" is faster and clearer. ^…$ only earns its keep once the middle uses real regexp features.

Absolute versus line-relative anchors#

Under the /m modifier, ^ and $ anchor at every line boundary inside the string, not just the outer ends:

my $x = "There once was a girl\nWho programmed in Perl\n";

$x =~ /^Who/;    # does not match - 'Who' is not at string start
$x =~ /^Who/m;   # matches - 'Who' is at start of second line

For the absolute ends regardless of /m, use the dedicated anchors:

\A - absolute start of string.
\z - absolute end of string.
\Z - end of string, or just before a final newline. Similar to $ but unaffected by /m.

$x =~ /^Who/m;    # matches, as above
$x =~ /\AWho/m;   # does not match - \A is string start, always

$x =~ /girl$/m;   # matches, end of first line
$x =~ /girl\Z/m;  # does not match - 'girl' is not at string end
$x =~ /Perl\Z/m;  # matches - end or before final newline
$x =~ /Perl\z/m;  # does not match - \z is strict end

Rule of thumb: \A and \z when you mean the whole string; ^ and $ when you mean lines (usually with /m). \A and \z are also faster: the engine knows they constrain position, and skips bump-along retries that ^ permits.

The four newline modes can be summarised in a table. Choose the one matching what ^, $, and . should mean for your input:

Mode	`^` matches at	`$` matches at	`.` matches `\n`?
default	start of string only	end of string (and before final `\n`)	no
`/m`	start of every line	end of every line	no
`/s`	start of string only	end of string (and before final `\n`)	yes
`/sm`	start of every line	end of every line	yes

\A, \Z, \z are unaffected by either modifier.

Word boundaries#

\b matches a position between a word character (\w) and a non-word character (\W), or between either and the edge of the string. It does not consume any character.

my $x = "Housecat catenates house and cat";

$x =~ /cat/;      # matches 'cat' inside 'Housecat'
$x =~ /\bcat/;    # matches 'cat' inside 'catenates'
$x =~ /cat\b/;    # matches 'cat' inside 'Housecat'
$x =~ /\bcat\b/;  # matches the standalone 'cat' at end

\B is the negation - a position that is not a word boundary.

\b outside […] is a word boundary. \b inside […] is the backspace character, \x08. The dual meaning is the single most common trap in the regex syntax. Always read \b’s context before reading what it does.

`\b` and `\W`-starting items - the `$3.75` gotcha#

\b requires a transition between word and non-word characters. If both sides of the position are non-word characters, there is no boundary. This catches people building patterns by interpolation:

my $item = '$3.75';
my $regex = qr/\b\Q$item\E\b/;

"is $3.75 plus tax" =~ /$regex/;   # does NOT match

Why? After \Q$item\E cooks, the pattern is \b\$3\.75\b. The engine looks for a position immediately before $ where one side of the position is a word character. The character before $ in the input is a space (non-word), and $ itself is also non-word. There is no word/non-word transition; \b fails.

To anchor the start of $item regardless of its first character:

my $regex = qr/(?:^|\s)\Q$item\E(?:\s|$)/;

Use string-edge or whitespace anchors when the interpolated content might begin with a non-word character. \b is for contexts where you know the surrounded text is word-like.

Unicode word-boundary variants#

Plain \b follows the \w/\W definition, which under Unicode covers letters, digits, marks, and connector punctuation. Real text - natural language with apostrophes, hyphens, numerics with commas - wants more nuance. Perl provides four semantic boundary variants:

Boundary	Meaning
`\b{wb}`	Unicode word boundary - handles `don't`, `state-of-the-art`
`\b{sb}`	sentence boundary
`\b{lb}`	line break (suitable for line wrapping)
`\b{gcb}`	grapheme-cluster boundary (same effect as `\X`)

"don't" =~ /.+?\b{wb}/x;   # matches whole word - apostrophe is inside
"don't" =~ /.+?\b/x;       # stops at the apostrophe - plain \b splits

For natural-language processing, \b{wb} and \b{sb} are almost always what you want. Plain \b is for ASCII identifier contexts.

`\G` - the last-match position#

\G anchors at the position where the previous successful /g match ended on the same string. It is what makes tokenisation with /g in while loops robust.

my $s = "12abc34";
while ($s =~ /\G(\d+|[a-z]+)/gc) {
    print "got '$1'\n";
}
# prints: '12', 'abc', '34'

Without \G, the same pattern would re-scan from wherever it first matched, skipping over characters that didn’t fit - quietly losing data. The /gc modifier preserves the position on failure; covered in the modifiers chapter.

Why `\G` matters: the synch-or-skip choice#

Friedl’s worked example. Find every five-digit ZIP code beginning with 44 in a run-together string:

my $s = "06192054410-44272-13901-44106-22134";
while ($s =~ /44(\d{3})/g) {
    print "got 44$1\n";
}

Without \G, the engine’s bump-along is happy to find 44272 and 44106 correctly and also find 44272 from a different start position whenever a match earlier was rejected. The output is “right answer plus extra wrong ones”.

With \G on every iteration, the engine cannot bump-along on failure - it must resume exactly where the last match ended. A failure terminates the loop:

while ($s =~ /\G\D*(44\d{3})/gc) {
    print "got $1\n";
}

\G effectively disables the bump-along, which is what synchronised tokenisation demands. The general rule: if the correctness of your loop depends on every match being adjacent to the last, you need \G.

\G reads the value pos($string) returns. Setting pos explicitly resets the anchor. On the first iteration of a /g loop (or any pattern not yet matched against this string), \G is equivalent to \A.

Lookahead#

Lookahead checks what comes next without consuming it. Positive lookahead is (?=…):

my $x = "I catch the housecat 'Tom-cat' with catnip";

$x =~ /cat(?=\s)/;   # matches 'cat' in 'housecat' - space follows
$x =~ /cat(?!\s)/;   # matches 'cat' in 'catch' - no space follows

(?!…) is negative lookahead: match only if the inner pattern does not apply. Zero-width, just like ^ and $.

"foobar" =~ /foo(?!bar)/;   # does not match
"foobaz" =~ /foo(?!bar)/;   # matches

Lookahead worked example: digits not followed by a period#

A natural-sounding question that is harder than it looks: “match runs of digits not followed by a period.”

The first try: \d+(?!\.). Apply it to OH 44272:

"OH 44272" =~ /\d+(?!\.)/;   # MATCHES, matches '44272'

That looks right, but for the wrong reason. Greedy \d+ first grabbed the entire digit run 44272; the next character is end-of-string (not .), so (?!\.) succeeded straight away - the lookahead happened to hold without the engine ever needing to back off. The intuitive reading “the run of digits ends with something other than a period” is satisfied for the trailing-digits case, but the next example shows what happens when it doesn’t.

Now apply it to 4423.45:

"4423.45" =~ /\d+(?!\.)/;   # matches '442', not '4423'

The engine’s first try: \d+ matches 4423. (?!\.) checks the next character: it is ., the negative lookahead fails. The engine backtracks: \d+ matches 442, the next character is 3, not ., success. The pattern matches 442 - not the full digit run before the period. The greedy quantifier was willing to back off to make the lookahead succeed.

The fix: require the lookahead to also reject other digits, so the run cannot back off into “digit followed by digit”.

"4423.45" =~ /\d+(?![\d.])/;   # matches '45' - correct

\d+(?![\d.]) reads as “match a run of digits not followed by either another digit or a period”. The greedy quantifier still expands maximally, but the lookahead now refuses to succeed in the middle of a digit run, so the only matching positions are the actual digit-run ends.

The lesson generalises: a lookahead that excludes only one character can be defeated by a greedy quantifier backing off. Always include the characters of the atom the quantifier matches in the lookahead’s negation.

A related case: leading negative lookahead and the bump-along.

"cattle" =~ /(?!cat)\w+/;   # MATCHES, captures 'cattle'

The lookahead (?!cat) rejects position 0 (where cat would match). The engine bumps along: position 1 starts with attle, where (?!cat) succeeds, and \w+ matches attle.

To force “first word that doesn’t begin with cat”, anchor the lookahead to a word boundary:

"cattle" =~ /\b(?!cat)\w+/;   # does not match

Now the lookahead applies at every word boundary, which in cattle is only position 0.

Lookbehind#

Lookbehind checks what came before. Positive lookbehind is (?<=…):

"cats and dogs" =~ /(?<=and )dogs/;   # matches 'dogs' after 'and '

Negative lookbehind is (?<!…):

"prefix_foo suffix_foo" =~ /(?<!prefix_)foo/;
# matches 'foo' in 'suffix_foo'

Modern Perl supports variable-length lookbehind from 1 to 255 characters. Patterns like (?<=cat|kitten) are fine. Patterns where the lookbehind itself contains capturing groups produce an experimental warning (the captures’ contents are not fully-defined when the lookbehind has variable length).

Because the lookbehind must finish before the current position, its maximum length is limited to 255 characters under default semantics. Under /i, a few characters fold to multi-character sequences (ß to ss), which counts the expanded length toward the 255 limit - so a 127-character lookbehind containing ß is fine, but 128 such characters is not.

A practical workaround for longer lookbehind: use \K (next section), which has no length limit.

`\K` - keep-left#

\K is the lookbehind alternative. Everything matched before \K is excluded from $&, but is required to have matched. Pragmatically: “match this prefix, but don’t include it in the output.”

"feed the cat" =~ /the \Kcat/;   # matches; $& is 'cat'

Equivalent to (?<=the )cat, but:

No length limit. \K works after any prefix.
Often substantially faster than (?<=…), because the engine does not need to look backward - it forgets the start of the match instead.
Especially useful in substitution: s/foo\Kbar/QUUX/ is cleaner than s/(foo)bar/$1QUUX/.

\K only forgets $& and @-/@+[0]; capture groups before \K are still set as captured. The construct may appear inside other lookarounds, though the behaviour there is described as “currently not well defined” - the conservative use is at the top level of a match or substitution.

Long-form lookaround aliases#

Each lookaround construct has a verb-style spelling. Long-form aliases read more clearly in patterns that already use (*VERB:…) for backtracking control:

Standard form	Verb form (short)	Verb form (long)
`(?=…)`	`(*pla:…)`	`(*positive_lookahead:…)`
`(?!…)`	`(*nla:…)`	`(*negative_lookahead:…)`
`(?<=…)`	`(*plb:…)`	`(*positive_lookbehind:…)`
`(?<!…)`	`(*nlb:…)`	`(*negative_lookbehind:…)`

The verb forms are exactly equivalent to the standard forms. They are accepted; they are not idiomatic in handwritten Perl. You will see them in autogenerated patterns and PCRE2 ports.

Combining lookaround with split#

Anchors and lookaround let split separate a string on invisible positions rather than visible characters:

my $str = "one two - --6-8";
my @toks = split / \s+            # whitespace
                 | (?<=\S) (?=-)  # non-space followed by '-'
                 | (?<=-)  (?=\S) # '-' followed by non-space
                 /x, $str;
# @toks = ("one", "two", "-", "-", "-", "6", "-", "8")

The second and third alternatives match between characters - no consumption. They are legal split separators because split is happy to split on a zero-width match.

Two zero-width assertions juxtaposed are AND-ed#

A small but illuminating fact: when two lookarounds appear next to each other in a pattern, both must hold at the same position. This works as expected:

$x =~ /^(\D*)(?=\d)(?!123)/;
# Matches the leading non-digit run, but only if a digit follows
# AND that digit run does not begin with '123'.

Both (?=\d) and (?!123) apply at the same position; the engine treats their conjunction as the requirement. Without (?=\d), the negative lookahead (?!123) is satisfiable by any non-123 continuation - including end-of-string or a non-digit, neither of which is what the author meant.

This generalises Friedl’s framing: juxtaposition in a regex always means AND, except when written with |. /ab/ means “a AND (then) b”, just as /^$/ means “start AND end” - except the AND is across positions for ab and at one position for the lookarounds.

Cross-engine: anchors and lookaround#

The cross-engine chapter has the full table; the relevant rows extracted.

Anchors and newline handling#

Concern	Perl 5.42	PCRE2	Emacs	POSIX BRE / ERE	RE2 / Go
`^` start-of-string	yes	yes	start-of-buffer	yes	yes
`^` after `\n` under multi-line	`/m`	`(?m)`	always (line-oriented)	not specified	`(?m)`
`$` end-of-string	yes	yes	end-of-buffer	yes	yes
`$` before final `\n`	yes	yes	no	varies	yes
`\A` / `\z`	yes	yes	\` / `\'`	no	yes
`\Z`	yes	yes	no	no	no (use `\z`)
`\G`	yes	yes	no	no	no

Lookaround and atomic constructs#

Feature	Perl 5.42	PCRE2	Emacs	POSIX	RE2 / Go
Lookahead `(?=…)` / `(?!…)`	yes	yes	NO	NO	NO
Fixed-width lookbehind	yes	yes	NO	NO	NO
Variable-length lookbehind	yes (exp.)	yes	NO	NO	NO
`\K`	yes	yes	NO	NO	NO
Atomic group `(?>…)`	yes	yes	NO	NO	NO

Effectively, only PCRE2 (and other Perl-derived engines outside this table - Java, Python, .NET) supports lookaround. POSIX tools and RE2 / Go simply do not have it. Patterns that rely on lookaround do not port across this divide.

Summary#

Anchor	Matches
`^`	start of string, or start of line under `/m`
`$`	end of string (before trailing `\n`), or line under `/m`
`\A`	absolute start of string
`\z`	absolute end of string
`\Z`	end of string or before trailing `\n`, ignores `/m`
`\b`	word boundary
`\B`	non-word-boundary
`\b{wb}`	Unicode word boundary
`\b{sb}`	sentence boundary
`\b{lb}`	line break
`\b{gcb}`	grapheme cluster boundary
`\G`	end of previous `/g` match (or `\A` if none)
`(?=…)`	positive lookahead
`(?!…)`	negative lookahead
`(?<=…)`	positive lookbehind
`(?<!…)`	negative lookbehind
`\K`	keep-left (forget the prefix from `$&`)