Anchors and assertions#
Anchors and assertions match positions rather than characters. They test where in the string you are, not what is there. A plain character class consumes one character; an anchor consumes none.
They are called zero-width assertions for that reason — they have width zero, but they assert a property must hold.
Start and end of string#
^ matches at the start of the string. $ matches at the end, or just before a final newline.
"housekeeper" =~ /keeper/; # matches — 'keeper' appears somewhere
"housekeeper" =~ /^keeper/; # does not match — not at start
"housekeeper" =~ /keeper$/; # matches — at end
"housekeeper\n" =~ /keeper$/; # matches — before the trailing newline
Used together, ^…$ forces the pattern to account for the whole string:
"bert" =~ /^bert$/; # matches
"bertram" =~ /^bert$/; # does not match
"dilbert" =~ /^bert$/; # does not match
"" =~ /^$/; # matches — empty string
For a literal string, $s eq "bert" is faster and clearer. ^…$ only earns its keep once the middle uses real regexp features.
Absolute versus line-relative anchors#
Under the /m modifier, ^ and $ anchor at every line boundary inside the string, not just the outer ends:
my $x = "There once was a girl\nWho programmed in Perl\n";
$x =~ /^Who/; # does not match — 'Who' is not at string start
$x =~ /^Who/m; # matches — 'Who' is at start of second line
For the absolute ends regardless of /m, use the dedicated anchors:
\A— absolute start of string.\z— absolute end of string.\Z— end of string, or just before a final newline. Similar to$but unaffected by/m.
$x =~ /^Who/m; # matches, as above
$x =~ /\AWho/m; # does not match — \A is string start, always
$x =~ /girl$/m; # matches, end of first line
$x =~ /girl\Z/m; # does not match — 'girl' is not at string end
$x =~ /Perl\Z/m; # matches — end or before final newline
$x =~ /Perl\z/m; # does not match — \z is strict end
Rule of thumb: \A and \z when you mean the whole string; ^ and $ when you mean lines (usually with /m). \A and \z are also faster: the engine knows they constrain position, and skips bump-along retries that ^ permits.
The four newline modes can be summarised in a table. Choose the one matching what ^, $, and . should mean for your input:
Mode |
|
|
|
|---|---|---|---|
default | start of string only | end of string (and before final | no |
| start of every line | end of every line | no |
| start of string only | end of string (and before final | yes |
| start of every line | end of every line | yes |
\A, \Z, \z are unaffected by either modifier.
Word boundaries#
\b matches a position between a word character (\w) and a non-word character (\W), or between either and the edge of the string. It does not consume any character.
my $x = "Housecat catenates house and cat";
$x =~ /cat/; # matches 'cat' inside 'Housecat'
$x =~ /\bcat/; # matches 'cat' inside 'catenates'
$x =~ /cat\b/; # matches 'cat' inside 'Housecat'
$x =~ /\bcat\b/; # matches the standalone 'cat' at end
\B is the negation — a position that is not a word boundary.
\b outside […] is a word boundary. \b inside […] is the backspace character, \x08. The dual meaning is the single most common trap in the regex syntax. Always read \b’s context before reading what it does.
\b and \W-starting items — the $3.75 gotcha#
\b requires a transition between word and non-word characters. If both sides of the position are non-word characters, there is no boundary. This catches people building patterns by interpolation:
my $item = '$3.75';
my $regex = qr/\b\Q$item\E\b/;
"is $3.75 plus tax" =~ /$regex/; # does NOT match
Why? After \Q$item\E cooks, the pattern is \b\$3\.75\b. The engine looks for a position immediately before $ where one side of the position is a word character. The character before $ in the input is a space (non-word), and $ itself is also non-word. There is no word/non-word transition; \b fails.
To anchor the start of $item regardless of its first character:
my $regex = qr/(?:^|\s)\Q$item\E(?:\s|$)/;
Use string-edge or whitespace anchors when the interpolated content might begin with a non-word character. \b is for contexts where you know the surrounded text is word-like.
Unicode word-boundary variants#
Plain \b follows the \w/\W definition, which under Unicode covers letters, digits, marks, and connector punctuation. Real text — natural language with apostrophes, hyphens, numerics with commas — wants more nuance. Perl provides four semantic boundary variants:
Boundary | Meaning |
|---|---|
| Unicode word boundary — handles |
| sentence boundary |
| line break (suitable for line wrapping) |
| grapheme-cluster boundary (same effect as |
"don't" =~ /.+?\b{wb}/x; # matches whole word — apostrophe is inside
"don't" =~ /.+?\b/x; # stops at the apostrophe — plain \b splits
For natural-language processing, \b{wb} and \b{sb} are almost always what you want. Plain \b is for ASCII identifier contexts.
\G — the last-match position#
\G anchors at the position where the previous successful /g match ended on the same string. It is what makes tokenisation with /g in while loops robust.
my $s = "12abc34";
while ($s =~ /\G(\d+|[a-z]+)/gc) {
print "got '$1'\n";
}
# prints: '12', 'abc', '34'
Without \G, the same pattern would re-scan from wherever it first matched, skipping over characters that didn’t fit — quietly losing data. The /gc modifier preserves the position on failure; covered in the modifiers chapter.
Why \G matters: the synch-or-skip choice#
Friedl’s worked example. Find every five-digit ZIP code beginning with 44 in a run-together string:
my $s = "06192054410-44272-13901-44106-22134";
while ($s =~ /44(\d{3})/g) {
print "got 44$1\n";
}
Without \G, the engine’s bump-along is happy to find 44272 and 44106 correctly and also find 44272 from a different start position whenever a match earlier was rejected. The output is “right answer plus extra wrong ones”.
With \G on every iteration, the engine cannot bump-along on failure — it must resume exactly where the last match ended. A failure terminates the loop:
while ($s =~ /\G\D*(44\d{3})/gc) {
print "got $1\n";
}
\G effectively disables the bump-along, which is what synchronised tokenisation demands. The general rule: if the correctness of your loop depends on every match being adjacent to the last, you need \G.
\G reads the value pos($string) returns. Setting pos explicitly resets the anchor. On the first iteration of a /g loop (or any pattern not yet matched against this string), \G is equivalent to \A.
Lookahead#
Lookahead checks what comes next without consuming it. Positive lookahead is (?=…):
my $x = "I catch the housecat 'Tom-cat' with catnip";
$x =~ /cat(?=\s)/; # matches 'cat' in 'housecat' — space follows
$x =~ /cat(?!\s)/; # matches 'cat' in 'catch' — no space follows
(?!…) is negative lookahead: match only if the inner pattern does not apply. Zero-width, just like ^ and $.
"foobar" =~ /foo(?!bar)/; # does not match
"foobaz" =~ /foo(?!bar)/; # matches
Lookahead worked example: digits not followed by a period#
A natural-sounding question that is harder than it looks: “match runs of digits not followed by a period.”
The first try: \d+(?!\.). Apply it to OH 44272:
"OH 44272" =~ /\d+(?!\.)/; # MATCHES, matches '44272'
That looks right, but for the wrong reason. Greedy \d+ first grabbed the entire digit run 44272; the next character is end-of-string (not .), so (?!\.) succeeded straight away — the lookahead happened to hold without the engine ever needing to back off. The intuitive reading “the run of digits ends with something other than a period” is satisfied for the trailing-digits case, but the next example shows what happens when it doesn’t.
Now apply it to 4423.45:
"4423.45" =~ /\d+(?!\.)/; # matches '442', not '4423'
The engine’s first try: \d+ matches 4423. (?!\.) checks the next character: it is ., the negative lookahead fails. The engine backtracks: \d+ matches 442, the next character is 3, not ., success. The pattern matches 442 — not the full digit run before the period. The greedy quantifier was willing to back off to make the lookahead succeed.
The fix: require the lookahead to also reject other digits, so the run cannot back off into “digit followed by digit”.
"4423.45" =~ /\d+(?![\d.])/; # matches '45' — correct
\d+(?![\d.]) reads as “match a run of digits not followed by either another digit or a period”. The greedy quantifier still expands maximally, but the lookahead now refuses to succeed in the middle of a digit run, so the only matching positions are the actual digit-run ends.
The lesson generalises: a lookahead that excludes only one character can be defeated by a greedy quantifier backing off. Always include the characters of the atom the quantifier matches in the lookahead’s negation.
A related case: leading negative lookahead and the bump-along.
"cattle" =~ /(?!cat)\w+/; # MATCHES, captures 'cattle'
The lookahead (?!cat) rejects position 0 (where cat would match). The engine bumps along: position 1 starts with attle, where (?!cat) succeeds, and \w+ matches attle.
To force “first word that doesn’t begin with cat”, anchor the lookahead to a word boundary:
"cattle" =~ /\b(?!cat)\w+/; # does not match
Now the lookahead applies at every word boundary, which in cattle is only position 0.
Lookbehind#
Lookbehind checks what came before. Positive lookbehind is (?<=…):
"cats and dogs" =~ /(?<=and )dogs/; # matches 'dogs' after 'and '
Negative lookbehind is (?<!…):
"prefix_foo suffix_foo" =~ /(?<!prefix_)foo/;
# matches 'foo' in 'suffix_foo'
Modern Perl supports variable-length lookbehind from 1 to 255 characters. Patterns like (?<=cat|kitten) are fine. Patterns where the lookbehind itself contains capturing groups produce an experimental warning (the captures’ contents are not fully-defined when the lookbehind has variable length).
Because the lookbehind must finish before the current position, its maximum length is limited to 255 characters under default semantics. Under /i, a few characters fold to multi-character sequences (ß to ss), which counts the expanded length toward the 255 limit — so a 127-character lookbehind containing ß is fine, but 128 such characters is not.
A practical workaround for longer lookbehind: use \K (next section), which has no length limit.
\K — keep-left#
\K is the lookbehind alternative. Everything matched before \K is excluded from $&, but is required to have matched. Pragmatically: “match this prefix, but don’t include it in the output.”
"feed the cat" =~ /the \Kcat/; # matches; $& is 'cat'
Equivalent to (?<=the )cat, but:
No length limit.
\Kworks after any prefix.Often substantially faster than
(?<=…), because the engine does not need to look backward — it forgets the start of the match instead.Especially useful in substitution:
s/foo\Kbar/QUUX/is cleaner thans/(foo)bar/$1QUUX/.
\K only forgets $& and @-/@+[0]; capture groups before \K are still set as captured. The construct may appear inside other lookarounds, though the behaviour there is described as “currently not well defined” — the conservative use is at the top level of a match or substitution.
Long-form lookaround aliases#
Each lookaround construct has a verb-style spelling. Long-form aliases read more clearly in patterns that already use (*VERB:…) for backtracking control:
Standard form | Verb form (short) | Verb form (long) |
|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
The verb forms are exactly equivalent to the standard forms. They are accepted; they are not idiomatic in handwritten Perl. You will see them in autogenerated patterns and PCRE2 ports.
Combining lookaround with split#
Anchors and lookaround let split separate a string on invisible positions rather than visible characters:
my $str = "one two - --6-8";
my @toks = split / \s+ # whitespace
| (?<=\S) (?=-) # non-space followed by '-'
| (?<=-) (?=\S) # '-' followed by non-space
/x, $str;
# @toks = ("one", "two", "-", "-", "-", "6", "-", "8")
The second and third alternatives match between characters — no consumption. They are legal split separators because split is happy to split on a zero-width match.
Two zero-width assertions juxtaposed are AND-ed#
A small but illuminating fact: when two lookarounds appear next to each other in a pattern, both must hold at the same position. This works as expected:
$x =~ /^(\D*)(?=\d)(?!123)/;
# Matches the leading non-digit run, but only if a digit follows
# AND that digit run does not begin with '123'.
Both (?=\d) and (?!123) apply at the same position; the engine treats their conjunction as the requirement. Without (?=\d), the negative lookahead (?!123) is satisfiable by any non-123 continuation — including end-of-string or a non-digit, neither of which is what the author meant.
This generalises Friedl’s framing: juxtaposition in a regex always means AND, except when written with |. /ab/ means “a AND (then) b”, just as /^$/ means “start AND end” — except the AND is across positions for ab and at one position for the lookarounds.
Cross-engine: anchors and lookaround#
The cross-engine chapter has the full table; the relevant rows extracted.
Anchors and newline handling#
Concern | Perl 5.42 | PCRE2 | Emacs | POSIX BRE / ERE | RE2 / Go |
|---|---|---|---|---|---|
| yes | yes | start-of-buffer | yes | yes |
|
|
| always (line-oriented) | not specified |
|
| yes | yes | end-of-buffer | yes | yes |
| yes | yes | no | varies | yes |
| yes | yes |
| no | yes |
| yes | yes | no | no | no (use |
| yes | yes | no | no | no |
Lookaround and atomic constructs#
Feature | Perl 5.42 | PCRE2 | Emacs | POSIX | RE2 / Go |
|---|---|---|---|---|---|
Lookahead | yes | yes | NO | NO | NO |
Fixed-width lookbehind | yes | yes | NO | NO | NO |
Variable-length lookbehind | yes (exp.) | yes | NO | NO | NO |
| yes | yes | NO | NO | NO |
Atomic group | yes | yes | NO | NO | NO |
Effectively, only PCRE2 (and other Perl-derived engines outside this table — Java, Python, .NET) supports lookaround. POSIX tools and RE2 / Go simply do not have it. Patterns that rely on lookaround do not port across this divide.
Summary#
Anchor | Matches |
|---|---|
| start of string, or start of line under |
| end of string (before trailing |
| absolute start of string |
| absolute end of string |
| end of string or before trailing |
| word boundary |
| non-word-boundary |
| Unicode word boundary |
| sentence boundary |
| line break |
| grapheme cluster boundary |
| end of previous |
| positive lookahead |
| negative lookahead |
| positive lookbehind |
| negative lookbehind |
| keep-left (forget the prefix from |
See also#
The modifiers chapter —
/m,/s,/g,/c.The unicode chapter —
\b{wb},\b{sb},\b{lb},\b{gcb}and what they mean.The performance chapter — atomic groups
(?>…), special backtracking control verbs.The cross-engine chapter — full anchor and lookaround compatibility table.
split— splitting at zero-width positions.pos— read or set the position\Ganchors at.