Basics#

The smallest useful regexp is a plain string. "Hello World" =~ /World/ asks: does the string on the left contain the pattern on the right? It does, so the expression is true.

if ("Hello World" =~ /World/) {
    print "matched\n";
}

The // enclose the pattern. The =~ operator binds the pattern to the string you want to test. Without a binding operator, Perl applies the pattern to $_ instead.

The match operator#

The long form is m//:

"Hello World" =~ m/World/;
"Hello World" =~ m!World!;    # alternate delimiters
"Hello World" =~ m{World};    # paired delimiters

m lets you pick any delimiter. That matters when the pattern itself contains the default delimiter / - compare

"/usr/bin/perl" =~ /\/usr\/bin\/perl/;  # leaning toothpick syndrome
"/usr/bin/perl" =~ m!/usr/bin/perl!;    # clearer

Paired delimiters ({}, (), [], <>) nest, which is useful when the pattern contains the delimiter character.

Without m, the leading slash is required: /pat/ only. With m, the leading m is required: m{pat}, not {pat}.

A small but useful corner: m'' (single quotes as delimiters) makes the pattern single-quotish - no variable interpolation, no doublequote escapes. Useful when the pattern is meant to contain literal $ or @ and you don’t want to escape them.

'price: $10' =~ m'\$\d+';      # $-as-anchor would be /$\d+/
'price: $10' =~ m'$10';        # literal '$10' - no interpolation

Other regex operators#

m// is one of four:

Operator	Purpose
`m//`	match - does the pattern match the string?
`s///`	substitute - replace the match with something else
`qr//`	compile a pattern object for reuse
`tr///` (also `y///`)	character-by-character translation

tr/// is not a regex operator despite living in the same neighbourhood - it does character-class translation, not pattern matching. It is mentioned for completeness; see tr for its semantics.

m, s, and qr all share the regex syntax covered in this guide. The differences are in what they do with a successful match.

Pattern reuse with `qr//`#

qr// compiles a pattern once and produces a reusable object. Use it when the same pattern is matched repeatedly, especially in loops:

my $word = qr/\b[a-z]+\b/;

for my $line (@lines) {
    while ($line =~ /$word/g) {
        print "$&\n";
    }
}

The compiled qr// object is dropped into other patterns by interpolation. It also lets you build patterns out of named pieces, which becomes essential for any regex over about ten lines:

my $name   = qr/[A-Z][a-z]+/;
my $number = qr/\d+/;
my $entry  = qr/$name \s+ $number/x;

"Alice 42" =~ /^$entry$/;   # matches

The same compile-once benefit, expressed at the right level of abstraction.

Binding: `=~` and `!~`#

=~ asks “does it match?”. !~ asks “does it fail to match?”.

$s = "Hello World";

print "yes\n" if $s =~ /World/;   # yes
print "no\n"  if $s !~ /planet/;  # no

!~ is not a separate regexp construct - it is the negated binding. It is equivalent to not ($s =~ /pat/).

Matching against `$_`#

If you omit the binding, the match is against $_:

for ("cat", "dog", "bird") {
    print "has an 'o'\n" if /o/;   # implicit: $_ =~ /o/
}

This is idiomatic in while (<>) loops, inside grep and map, and inside for loops that set $_.

Case sensitivity and the default anchor#

Matches are case-sensitive and unanchored:

"Hello" =~ /hello/;    # does not match - case differs
"Hello" =~ /ell/;      # matches - inside the string is fine

To match case-insensitively, append /i. To constrain the match to the start or end of the string, use anchors. Both are covered in their own chapters.

When a pattern could match at several positions, Perl tries from the left and takes the first one that works:

"That hat is red" =~ /hat/;   # matches 'hat' in 'That', not in 'hat'

The “leftmost match wins” rule is fundamental and pre-empts every other match preference: a match at an earlier position is always better than a match at a later position, regardless of what choices either match makes internally. This is sometimes called the bump-along property: the engine tries to match at position 0; if it fails, it bumps to position 1; if it fails, position 2; and so on, returning the first success.

Metacharacters#

Most characters in a pattern match themselves. These do not:

{ } [ ] ( ) ^ $ . | * + ? \

Two more are special only in specific contexts:

- is a metacharacter only inside a character class, where it forms a range ([a-z]). Outside […] it is literal.
# is a metacharacter only under the /x flag, where it introduces a comment to end of line. Without /x it is literal.

Each metacharacter has a special meaning covered later. To match a literal copy of one, put a backslash in front:

"2+2=4" =~ /2+2/;    # fails - '+' is a quantifier, needs escaping
"2+2=4" =~ /2\+2/;   # matches

"end." =~ /end\./;   # matches a literal dot
"end." =~ /end./;    # also matches - but . matches any character,
                     # so this would also match "endx", "end ", etc.

The backslash itself is a metacharacter, so a literal backslash in a pattern needs \\:

'C:\WIN32' =~ /C:\\WIN/;    # matches

A metacharacter that has nothing special to do in its context reverts to matching itself. } only closes a {…} quantifier; outside that context it is a literal }. This is convenient but easy to misread; see Strict mode below.

Strict mode: `use re 'strict'`#

use re 'strict' turns previously-tolerated regex sloppiness into compile-time errors. Use it when you want the regex compiler to flag patterns that probably mean something different from what you wrote:

use re 'strict';

/abc{,1/;        # error: unescaped '{' in non-quantifier context
/(?-p)/;         # error: useless negation of always-on flag
/[a-]/;          # error: dash at end of class

Strict mode is per-lexical-scope, so it can be turned on for regex-heavy modules without affecting other code. It is not default because it would break working older patterns; new code should consider it.

Escape sequences#

Non-printing characters use the same escapes as in double-quoted strings:

Sequence	Matches
`\t`	tab
`\n`	newline
`\r`	carriage return
`\f`	form feed
`\a`	alert (bell)
`\e`	escape (`\x1B`)
`\0`	NUL byte
`\xHH`	byte with hex value HH
`\x{…}`	Unicode codepoint with hex value
`\o{…}`	octal codepoint
`\cX`	control-X
`\N{…}`	Unicode character by name

"1000\t2000" =~ /0\t2/;      # matches
"a\x{263a}b" =~ /\x{263a}/;  # matches U+263A, WHITE SMILING FACE

The full Unicode story is in the unicode chapter.

Variables in patterns#

A pattern is (by default) interpolated like a double-quoted string, so variables are substituted before matching:

my $word = "house";
"housecat" =~ /$word/;       # matches
"housecat" =~ /${word}cat/;  # matches - braces disambiguate

To match a literal $ or @, escape it:

'price: $10' =~ /\$10/;      # matches a literal dollar sign

If a user-supplied string will be interpolated into a pattern and you want its metacharacters treated literally, use quotemeta - or its in-pattern equivalent \Q…\E:

my $input = "1+1";
"1+1=2" =~ /\Q$input\E/;     # matches the literal string

Without \Q…\E the + would be read as a quantifier.

How a regex is read#

Perl reads a regex pattern in four phases. Knowing the phases explains a few “why does that work?” questions:

Phase A: the parser identifies the delimiter and finds the end of the pattern. (?#…) comments are removed.
Phase B: the pattern is parsed as a double-quotish string - variables interpolate, escape sequences cook, \Q…\E translates to quotemeta-style escaping.
Phase C: under /x or /xx, unescaped whitespace and comments after # are stripped.
Phase D: the regex compiler reads the result and turns it into the engine’s internal form.

The order matters because phases B and D operate on different representations. \Q$dir\E cooks at Phase B, before the regex compiler sees it - by Phase D, the variable’s contents are already escaped, and the regex compiler sees a literal pattern. Conversely, \U…\E is interpreted by Phase B as a string-cook directive (uppercase the contents), which is almost certainly not what you want inside a regex. The convention is: \Q and \E for regex purposes; \U, \L, \u, \l only when you know what you are doing.

(?#comment) is removed before Phase B sees the pattern at all. A literal # inside (?#…) is fine. A literal ) is not - the comment ends at the first ).

Earliest match wins - the bump-along#

Friedl puts it precisely:

The match that begins earliest wins.

Position 0 is tried first, then 1, then 2, and so on. The engine never prefers a later, longer, or “more aesthetic” match over an earlier one. This is so foundational it shapes every other rule in this guide:

“Greedy quantifiers grab as much as possible” - at the current starting position. The grabbing happens after the bump-along has chosen the position.
“Leftmost alternative wins” - at the current starting position. The alternative chosen affects what is captured but not where the match starts.
Anchors like ^ constrain what positions are legal, not what is preferred among legal ones.

Say what you mean#

Friedl’s recurring point: vagueness in the regex causes both correctness problems and performance problems. The example he hammers:

/-?[0-9]*\.?[0-9]*/

Read as English: “an optional sign, optional digits, optional decimal, optional digits.” Read as code: every part is optional, so the pattern matches the empty string - at the start of any input, before the engine has even looked. Apply it to “nothing here” and it matches at position 0, capturing the empty string, returning success.

The fix is to require at least one digit on at least one side of the decimal:

/-?[0-9]+(\.[0-9]*)?|-?\.[0-9]+/

Two alternatives, each requiring at least one digit. The pattern now matches what its author intended.

The lesson generalises. A pattern that can match the empty string usually will match the empty string somewhere. If your match has to mean something, write requirements that force it to mean something.

Substitution at a glance#

Replacing text uses the s/// operator, which takes a pattern and a replacement string:

my $x = "feed the cat";
$x =~ s/cat/dog/;            # $x is now "feed the dog"

Substitution is covered in depth in its own chapter; it is mentioned here so you can combine it with the facts above. Most everything that applies to m// patterns applies inside s/// patterns too.

Where to go next#

Literal matches get you surprisingly far, but every real regexp uses character classes, anchors, or quantifiers. Character classes come next - they let one position in the pattern accept any of several characters.

If you are reading the guide for a specific question, the chapters are independent and the index lists them all. Patterns that read fine but run for hours are covered in performance - when in doubt, the answer is usually there.