--- name: regex anchors and assertions --- # Anchors and assertions Anchors and assertions match *positions* rather than characters. They test where in the string you are, not what is there. A plain character class consumes one character; an anchor consumes none. They are called *zero-width assertions* for that reason — they have width zero, but they assert a property must hold. ## Start and end of string `^` matches at the start of the string. `$` matches at the end, or just before a final newline. ```perl "housekeeper" =~ /keeper/; # matches — 'keeper' appears somewhere "housekeeper" =~ /^keeper/; # does not match — not at start "housekeeper" =~ /keeper$/; # matches — at end "housekeeper\n" =~ /keeper$/; # matches — before the trailing newline ``` Used together, `^…$` forces the pattern to account for the whole string: ```perl "bert" =~ /^bert$/; # matches "bertram" =~ /^bert$/; # does not match "dilbert" =~ /^bert$/; # does not match "" =~ /^$/; # matches — empty string ``` For a literal string, `$s eq "bert"` is faster and clearer. `^…$` only earns its keep once the middle uses real regexp features. ## Absolute versus line-relative anchors Under the `/m` modifier, `^` and `$` anchor at every line boundary inside the string, not just the outer ends: ```perl my $x = "There once was a girl\nWho programmed in Perl\n"; $x =~ /^Who/; # does not match — 'Who' is not at string start $x =~ /^Who/m; # matches — 'Who' is at start of second line ``` For the absolute ends *regardless of `/m`*, use the dedicated anchors: - `\A` — absolute start of string. - `\z` — absolute end of string. - `\Z` — end of string, or just before a final newline. Similar to `$` but unaffected by `/m`. ```perl $x =~ /^Who/m; # matches, as above $x =~ /\AWho/m; # does not match — \A is string start, always $x =~ /girl$/m; # matches, end of first line $x =~ /girl\Z/m; # does not match — 'girl' is not at string end $x =~ /Perl\Z/m; # matches — end or before final newline $x =~ /Perl\z/m; # does not match — \z is strict end ``` Rule of thumb: `\A` and `\z` when you mean the whole string; `^` and `$` when you mean lines (usually with `/m`). ## Word boundaries `\b` matches a position between a word character (`\w`) and a non-word character (`\W`), or between either and the edge of the string. It does not consume any character. ```perl my $x = "Housecat catenates house and cat"; $x =~ /cat/; # matches 'cat' inside 'Housecat' $x =~ /\bcat/; # matches 'cat' inside 'catenates' $x =~ /cat\b/; # matches 'cat' inside 'Housecat' $x =~ /\bcat\b/; # matches the standalone 'cat' at end ``` `\B` is the negation — a position that is *not* a word boundary. For natural-language splitting that handles apostrophes, hyphens, and other subtleties, use `\b{wb}`: ```perl "don't" =~ /.+?\b{wb}/x; # matches the whole string ``` ## \G — the last-match position `\G` anchors at the position where the previous successful `/g` match ended on the same string. It is what makes tokenisation with `/g` in `while` loops robust. ```perl my $s = "12abc34"; while ($s =~ /\G(\d+|[a-z]+)/gc) { print "got '$1'\n"; } # prints: '12', 'abc', '34' ``` Without `\G`, the same pattern would re-scan from wherever it first matched, skipping over characters that didn't fit — quietly losing data. The `/gc` modifier preserves the position on failure; covered in the modifiers chapter. ## Lookahead Lookahead checks what comes next without consuming it. Positive lookahead is `(?=…)`: ```perl my $x = "I catch the housecat 'Tom-cat' with catnip"; $x =~ /cat(?=\s)/; # matches 'cat' in 'housecat' — space follows $x =~ /cat(?!\s)/; # matches 'cat' in 'catch' — no space follows ``` `(?!…)` is negative lookahead: match only if the inner pattern does *not* apply. Zero-width, just like `^` and `$`. Worked example: match `foo` only when `bar` does not immediately follow it: ```perl "foobar" =~ /foo(?!bar)/; # does not match "foobaz" =~ /foo(?!bar)/; # matches ``` ## Lookbehind Lookbehind checks what came before. Positive lookbehind is `(?<=…)`: ```perl "cats and dogs" =~ /(?<=and )dogs/; # matches 'dogs' after 'and ' ``` Negative lookbehind is `(?