Anchors and assertions#

Anchors and assertions match positions rather than characters. They test where in the string you are, not what is there. A plain character class consumes one character; an anchor consumes none.

They are called zero-width assertions for that reason — they have width zero, but they assert a property must hold.

Start and end of string#

^ matches at the start of the string. $ matches at the end, or just before a final newline.

"housekeeper" =~ /keeper/;    # matches — 'keeper' appears somewhere
"housekeeper" =~ /^keeper/;   # does not match — not at start
"housekeeper" =~ /keeper$/;   # matches — at end
"housekeeper\n" =~ /keeper$/; # matches — before the trailing newline

Used together, ^…$ forces the pattern to account for the whole string:

"bert"    =~ /^bert$/;     # matches
"bertram" =~ /^bert$/;     # does not match
"dilbert" =~ /^bert$/;     # does not match
""        =~ /^$/;         # matches — empty string

For a literal string, $s eq "bert" is faster and clearer. ^…$ only earns its keep once the middle uses real regexp features.

Absolute versus line-relative anchors#

Under the /m modifier, ^ and $ anchor at every line boundary inside the string, not just the outer ends:

my $x = "There once was a girl\nWho programmed in Perl\n";

$x =~ /^Who/;    # does not match — 'Who' is not at string start
$x =~ /^Who/m;   # matches — 'Who' is at start of second line

For the absolute ends regardless of /m, use the dedicated anchors:

  • \A — absolute start of string.

  • \z — absolute end of string.

  • \Z — end of string, or just before a final newline. Similar to $ but unaffected by /m.

$x =~ /^Who/m;    # matches, as above
$x =~ /\AWho/m;   # does not match — \A is string start, always

$x =~ /girl$/m;   # matches, end of first line
$x =~ /girl\Z/m;  # does not match — 'girl' is not at string end
$x =~ /Perl\Z/m;  # matches — end or before final newline
$x =~ /Perl\z/m;  # does not match — \z is strict end

Rule of thumb: \A and \z when you mean the whole string; ^ and $ when you mean lines (usually with /m).

Word boundaries#

\b matches a position between a word character (\w) and a non-word character (\W), or between either and the edge of the string. It does not consume any character.

my $x = "Housecat catenates house and cat";

$x =~ /cat/;      # matches 'cat' inside 'Housecat'
$x =~ /\bcat/;    # matches 'cat' inside 'catenates'
$x =~ /cat\b/;    # matches 'cat' inside 'Housecat'
$x =~ /\bcat\b/;  # matches the standalone 'cat' at end

\B is the negation — a position that is not a word boundary.

For natural-language splitting that handles apostrophes, hyphens, and other subtleties, use \b{wb}:

"don't" =~ /.+?\b{wb}/x;   # matches the whole string

\G — the last-match position#

\G anchors at the position where the previous successful /g match ended on the same string. It is what makes tokenisation with /g in while loops robust.

my $s = "12abc34";
while ($s =~ /\G(\d+|[a-z]+)/gc) {
    print "got '$1'\n";
}
# prints: '12', 'abc', '34'

Without \G, the same pattern would re-scan from wherever it first matched, skipping over characters that didn’t fit — quietly losing data. The /gc modifier preserves the position on failure; covered in the modifiers chapter.

Lookahead#

Lookahead checks what comes next without consuming it. Positive lookahead is (?=…):

my $x = "I catch the housecat 'Tom-cat' with catnip";

$x =~ /cat(?=\s)/;   # matches 'cat' in 'housecat' — space follows
$x =~ /cat(?!\s)/;   # matches 'cat' in 'catch' — no space follows

(?!…) is negative lookahead: match only if the inner pattern does not apply. Zero-width, just like ^ and $.

Worked example: match foo only when bar does not immediately follow it:

"foobar" =~ /foo(?!bar)/;   # does not match
"foobaz" =~ /foo(?!bar)/;   # matches

Lookbehind#

Lookbehind checks what came before. Positive lookbehind is (?<=…):

"cats and dogs" =~ /(?<=and )dogs/;   # matches 'dogs' after 'and '

Negative lookbehind is (?<!…):

"prefix_foo suffix_foo" =~ /(?<!prefix_)foo/;
# matches 'foo' in 'suffix_foo'

Historically, lookbehind required fixed-width alternatives ((?<=a|bb) was fine, (?<=a+) was not). Modern Perl allows variable-width lookbehind, but fixed-width lookbehind is still faster to match.

Combining lookaround with split#

Anchors and lookaround let split separate a string on invisible positions rather than visible characters:

my $str = "one two - --6-8";
my @toks = split / \s+            # whitespace
                 | (?<=\S) (?=-)  # non-space followed by '-'
                 | (?<=-)  (?=\S) # '-' followed by non-space
                 /x, $str;
# @toks = ("one", "two", "-", "-", "-", "6", "-", "8")

The second and third alternatives match between characters — no consumption. They are legal split separators because split is happy to split on a zero-width match.

Summary#

Anchor

Matches

^

start of string, or start of line under /m

$

end of string (before trailing \n), or line under /m

\A

absolute start of string

\z

absolute end of string

\Z

end of string or before trailing \n, ignores /m

\b

word boundary

\B

non-word-boundary

\G

end of previous /g match

(?=…)

positive lookahead

(?!…)

negative lookahead

(?<=…)

positive lookbehind

(?<!…)

negative lookbehind

See also#

  • perlre — complete reference, including the less-common anchors \K and \b{…} variants.

  • split — splitting at zero-width positions.

  • pos — read or set the position \G anchors at.