Anchors and assertions#
Anchors and assertions match positions rather than characters. They test where in the string you are, not what is there. A plain character class consumes one character; an anchor consumes none.
They are called zero-width assertions for that reason — they have width zero, but they assert a property must hold.
Start and end of string#
^ matches at the start of the string. $ matches at the end, or
just before a final newline.
"housekeeper" =~ /keeper/; # matches — 'keeper' appears somewhere
"housekeeper" =~ /^keeper/; # does not match — not at start
"housekeeper" =~ /keeper$/; # matches — at end
"housekeeper\n" =~ /keeper$/; # matches — before the trailing newline
Used together, ^…$ forces the pattern to account for the whole
string:
"bert" =~ /^bert$/; # matches
"bertram" =~ /^bert$/; # does not match
"dilbert" =~ /^bert$/; # does not match
"" =~ /^$/; # matches — empty string
For a literal string, $s eq "bert" is faster and clearer. ^…$
only earns its keep once the middle uses real regexp features.
Absolute versus line-relative anchors#
Under the /m modifier, ^ and $ anchor at every line boundary
inside the string, not just the outer ends:
my $x = "There once was a girl\nWho programmed in Perl\n";
$x =~ /^Who/; # does not match — 'Who' is not at string start
$x =~ /^Who/m; # matches — 'Who' is at start of second line
For the absolute ends regardless of /m, use the dedicated
anchors:
\A— absolute start of string.\z— absolute end of string.\Z— end of string, or just before a final newline. Similar to$but unaffected by/m.
$x =~ /^Who/m; # matches, as above
$x =~ /\AWho/m; # does not match — \A is string start, always
$x =~ /girl$/m; # matches, end of first line
$x =~ /girl\Z/m; # does not match — 'girl' is not at string end
$x =~ /Perl\Z/m; # matches — end or before final newline
$x =~ /Perl\z/m; # does not match — \z is strict end
Rule of thumb: \A and \z when you mean the whole string; ^ and
$ when you mean lines (usually with /m).
Word boundaries#
\b matches a position between a word character (\w) and a
non-word character (\W), or between either and the edge of the
string. It does not consume any character.
my $x = "Housecat catenates house and cat";
$x =~ /cat/; # matches 'cat' inside 'Housecat'
$x =~ /\bcat/; # matches 'cat' inside 'catenates'
$x =~ /cat\b/; # matches 'cat' inside 'Housecat'
$x =~ /\bcat\b/; # matches the standalone 'cat' at end
\B is the negation — a position that is not a word boundary.
For natural-language splitting that handles apostrophes, hyphens,
and other subtleties, use \b{wb}:
"don't" =~ /.+?\b{wb}/x; # matches the whole string
\G — the last-match position#
\G anchors at the position where the previous successful /g
match ended on the same string. It is what makes tokenisation with
/g in while loops robust.
my $s = "12abc34";
while ($s =~ /\G(\d+|[a-z]+)/gc) {
print "got '$1'\n";
}
# prints: '12', 'abc', '34'
Without \G, the same pattern would re-scan from wherever it first
matched, skipping over characters that didn’t fit — quietly losing
data. The /gc modifier preserves the position on failure; covered
in the modifiers chapter.
Lookahead#
Lookahead checks what comes next without consuming it. Positive
lookahead is (?=…):
my $x = "I catch the housecat 'Tom-cat' with catnip";
$x =~ /cat(?=\s)/; # matches 'cat' in 'housecat' — space follows
$x =~ /cat(?!\s)/; # matches 'cat' in 'catch' — no space follows
(?!…) is negative lookahead: match only if the inner pattern does
not apply. Zero-width, just like ^ and $.
Worked example: match foo only when bar does not immediately
follow it:
"foobar" =~ /foo(?!bar)/; # does not match
"foobaz" =~ /foo(?!bar)/; # matches
Lookbehind#
Lookbehind checks what came before. Positive lookbehind is (?<=…):
"cats and dogs" =~ /(?<=and )dogs/; # matches 'dogs' after 'and '
Negative lookbehind is (?<!…):
"prefix_foo suffix_foo" =~ /(?<!prefix_)foo/;
# matches 'foo' in 'suffix_foo'
Historically, lookbehind required fixed-width alternatives ((?<=a|bb)
was fine, (?<=a+) was not). Modern Perl allows variable-width
lookbehind, but fixed-width lookbehind is still faster to match.
Combining lookaround with split#
Anchors and lookaround let split separate a string on invisible
positions rather than visible characters:
my $str = "one two - --6-8";
my @toks = split / \s+ # whitespace
| (?<=\S) (?=-) # non-space followed by '-'
| (?<=-) (?=\S) # '-' followed by non-space
/x, $str;
# @toks = ("one", "two", "-", "-", "-", "6", "-", "8")
The second and third alternatives match between characters — no consumption. They are legal split separators because split is happy to split on a zero-width match.
Summary#
Anchor |
Matches |
|---|---|
|
start of string, or start of line under |
|
end of string (before trailing |
|
absolute start of string |
|
absolute end of string |
|
end of string or before trailing |
|
word boundary |
|
non-word-boundary |
|
end of previous |
|
positive lookahead |
|
negative lookahead |
|
positive lookbehind |
|
negative lookbehind |