Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Regular Expressions

PetaPerl implements a Perl 5-compatible regex engine with support for the full range of Perl regex features.

Pattern Syntax

Literals

/hello/         # Match literal "hello"
/foo bar/       # Match "foo bar"

Metacharacters

CharMeaning
.Any character except newline
^Start of string
$End of string
\AStart of string (absolute)
\zEnd of string (absolute)
\ZEnd of string or before final newline
\bWord boundary
\BNot word boundary
\GPosition of last match
/^start/        # Must be at start
/end$/          # Must be at end
/\bword\b/      # Whole word match
/\Abegin/       # Absolute start
/finish\z/      # Absolute end

Character Classes

[abc]           # Match a, b, or c
[^abc]          # Match anything except a, b, c
[a-z]           # Match lowercase letter
[A-Z0-9]        # Match uppercase or digit
[a-zA-Z_]       # Match word character

Predefined Character Classes

ClassMatchesNegated
\dDigit [0-9]\D (non-digit)
\wWord char [a-zA-Z0-9_]\W (non-word)
\sWhitespace [ \t\n\r\f]\S (non-whitespace)
\hHorizontal whitespace\H
\vVertical whitespace\V
/\d+/           # One or more digits
/\w+/           # One or more word chars
/\s*/           # Zero or more spaces

POSIX Character Classes

[:alnum:]       # Alphanumeric [a-zA-Z0-9]
[:alpha:]       # Alphabetic [a-zA-Z]
[:ascii:]       # ASCII characters [0-127]
[:blank:]       # Space and tab
[:cntrl:]       # Control characters
[:digit:]       # Digits [0-9]
[:graph:]       # Visible characters (not space)
[:lower:]       # Lowercase letters
[:print:]       # Printable characters
[:punct:]       # Punctuation
[:space:]       # Whitespace
[:upper:]       # Uppercase letters
[:word:]        # Word characters [a-zA-Z0-9_]
[:xdigit:]      # Hex digits [0-9A-Fa-f]

Usage: [[:digit:]] or [[:alpha:][:digit:]]

Quantifiers

QuantifierMeaningGreedyNon-greedyPossessive
*0 or moreYes*?*+
+1 or moreYes+?++
?0 or 1Yes???+
{n}Exactly nYesN/AN/A
{n,}n or moreYes{n,}?N/A
{n,m}n to mYes{n,m}?N/A
/a*/            # 0 or more 'a' (greedy)
/a*?/           # 0 or more 'a' (non-greedy)
/a+/            # 1 or more 'a'
/a?/            # 0 or 1 'a'
/a{3}/          # Exactly 3 'a'
/a{3,}/         # 3 or more 'a'
/a{3,5}/        # 3 to 5 'a'

Greedy vs Non-greedy: Greedy quantifiers match as much as possible, non-greedy match as little as possible.

# Given: "foo123bar"
/\d+/           # Matches "123" (greedy)
/\d+?/          # Matches "1" (non-greedy, but entire pattern must match)

Groups and Captures

Capturing Groups

/(foo)/         # Capture "foo" in $1
/(foo)(bar)/    # Capture in $1 and $2

Access captures with $1, $2, etc., or use @+ array.

Non-Capturing Groups

/(?:foo)/       # Group but don't capture

Use when grouping is needed but capture overhead is not.

Named Captures

/(?<name>\w+)/  # Named capture "name"
/(?'name'\w+)/  # Alternative syntax

Access with $+{name} hash.

Alternation

/foo|bar/       # Match "foo" or "bar"
/(red|green|blue)/ # Capture color

Anchors and Assertions

Zero-Width Assertions

AssertionMeaning
(?=pattern)Positive lookahead
(?!pattern)Negative lookahead
(?<=pattern)Positive lookbehind
(?<!pattern)Negative lookbehind
/foo(?=bar)/    # "foo" followed by "bar" (bar not consumed)
/foo(?!bar)/    # "foo" not followed by "bar"
/(?<=foo)bar/   # "bar" preceded by "foo"
/(?<!foo)bar/   # "bar" not preceded by "foo"

Atomic Groups

/(?>pattern)/   # Atomic group (no backtracking)

Once matched, the group’s contents are fixed. Used for performance optimization.

Backreferences

/(foo)\1/       # Match "foofoo" - \1 references first capture
/(['"]).*?\1/   # Match quoted string (same quote type)

Named Backreferences

/(?<tag>\w+)...\k<tag>/ # Named backreference

Conditionals

/(?(condition)yes|no)/ # If condition matches, try "yes", else try "no"

Conditions can be:

  • Capture group number: (?(1)yes|no) - if group 1 matched
  • Named capture: (?(<name>)yes|no) - if named group matched
  • Lookahead: (?(?=test)yes|no) - if lookahead succeeds

Modifiers

Modifiers change regex behavior. Applied after closing delimiter:

ModifierMeaning
iCase-insensitive
mMultiline (^/$ match line boundaries)
sSingle-line (. matches newline)
xExtended (ignore whitespace, allow comments)
gGlobal (find all matches)
cContinue searching after failed match
oCompile once (legacy, not needed in PetaPerl)
eEvaluate replacement as code (in s///)
/pattern/i      # Case-insensitive
/pattern/ms     # Multiline + single-line
/pattern/x      # Extended (readable)
/pattern/g      # Global matching

Examples

# Case-insensitive
if ($str =~ /hello/i) { ... }

# Multiline: ^ and $ match line starts/ends
while ($text =~ /^Line: (.+)$/mg) {
    print "Found: $1\n";
}

# Extended: whitespace and comments ignored
my $email_re = qr{
    (\w+)           # Username
    @               # At sign
    ([\w.]+)        # Domain
}x;

# Global: find all matches
my @words = $text =~ /\w+/g;

Matching and Substitution

Match Operator

$str =~ /pattern/       # True if matches
$str =~ /pattern/g      # Global, returns all matches

In list context with captures:

my ($user, $domain) = $email =~ /(\w+)@([\w.]+)/;

In list context with global:

my @numbers = $text =~ /\d+/g;  # All numbers

Substitution

$str =~ s/old/new/      # Replace first occurrence
$str =~ s/old/new/g     # Replace all occurrences
$str =~ s/old/new/i     # Case-insensitive replace
$str =~ s/old/new/gi    # Global + case-insensitive

Replacement with captures:

$str =~ s/(\w+)@(\w+)/$2\@$1/;  # Reverse user@domain

Evaluated replacement:

$str =~ s/(\d+)/$1 * 2/e;       # Double all numbers

Transliteration

$str =~ tr/abc/xyz/     # Replace a→x, b→y, c→z
$str =~ y/abc/xyz/      # Same as tr
$str =~ tr/a-z/A-Z/     # Uppercase
$str =~ tr/ //d         # Delete spaces
$str =~ tr/a-z//c       # Count non-lowercase

Special Variables

After a successful match:

VariableContains
$&Entire matched string
$`String before match
$'String after match
$1, $2, …Capture groups
$+Last matched capture
@+End positions of captures
@-Start positions of captures
%+Named captures
if ($str =~ /(foo)(bar)/) {
    print "Full match: $&\n";     # "foobar"
    print "Group 1: $1\n";        # "foo"
    print "Group 2: $2\n";        # "bar"
    print "Before: $`\n";
    print "After: $'\n";
}

Regex Compilation

qr// Operator

Compile regex for reuse:

my $word = qr/\w+/;
my $email = qr/\w+@\w+\.\w+/;

if ($str =~ $word) { ... }
if ($str =~ /$word@$word/) { ... }  # Interpolate

Benefits:

  • Compile once, use many times
  • Readable regex composition
  • Performance optimization

Performance Considerations

Anchored Patterns

Patterns anchored with ^ or \A are faster:

/^pattern/      # Fast: only checks start
/pattern/       # Slower: scans entire string

Atomic Groups

Use atomic groups (?>...) to prevent backtracking:

# Slow: backtracks on failure
/\d+\w+/

# Fast: no backtracking in \d+
/(?>\d+)\w+/

Non-Capturing Groups

Use (?:...) when captures aren’t needed:

/(?:foo|bar)/   # Faster than /(foo|bar)/ when capture not needed

PetaPerl-Specific Features

Bytecode Compilation

Regex patterns compile to bytecode for efficient execution. PetaPerl uses:

  • Bitmap character classes for fast ASCII matching
  • Literal prefix extraction to skip impossible positions
  • Anchored detection to avoid unnecessary scanning

Possessive Quantifiers

Possessive quantifiers prevent backtracking entirely (more efficient than atomic groups for simple cases):

/a++/           # 1 or more 'a', no backtracking
/a*+/           # 0 or more 'a', no backtracking
/a?+/           # 0 or 1 'a', no backtracking

Embedded Code

/pattern(?{ code })/    # Execute code during match
/(??{ code })/          # Postponed regex (code returns pattern)

(?{code}) executes Perl code at the point in the pattern where it appears. The code can access $1, $2, etc. from captures made so far.

Current Limitations

PetaPerl’s regex engine passes 99.3% of perl5’s re_tests suite (1959/1972 tests). Remaining gaps:

  • Self-referential captures — patterns like (a\1) (3 tests)
  • local in code blocks(?{ local $x = ... }) (2 tests)
  • Multi-character case folding — Unicode chars that fold to multiple chars (2 tests)
  • Branch reset backreferences — complex (?|...) with backrefs (5 tests)
  • String interpolation edge case — 1 test

Unicode property support (\p{Letter}, \p{Digit}, etc.) is fully implemented for standard Unicode categories.

Examples

Email Validation

my $email_re = qr/^[\w.+-]+@[\w.-]+\.[a-zA-Z]{2,}$/;
if ($email =~ $email_re) {
    print "Valid email\n";
}

URL Parsing

my ($protocol, $host, $path) = $url =~
    m{^(https?)://([^/]+)(/.*)$};

Log File Parsing

while ($line =~ /\[(\d{4}-\d{2}-\d{2})\] (\w+): (.+)/g) {
    my ($date, $level, $msg) = ($1, $2, $3);
    # Process log entry
}

String Cleanup

# Remove multiple spaces
$text =~ s/\s+/ /g;

# Remove leading/trailing whitespace
$text =~ s/^\s+|\s+$//g;

# Or using two substitutions
$text =~ s/^\s+//;
$text =~ s/\s+$//;

Template Substitution

my %vars = (name => "John", age => 30);
my $template = "Hello {{name}}, you are {{age}} years old.";
$template =~ s/\{\{(\w+)\}\}/$vars{$1}/ge;

See Also

  • perlop - Binding operators =~ and !~
  • perlvar - Special variables like $&, $1, etc.