Regular Expressions#

PetaPerl implements a Perl 5-compatible regex engine with support for the full range of Perl regex features.

Pattern Syntax#

Literals#

/hello/         # Match literal "hello"
/foo bar/       # Match "foo bar"

Metacharacters#

Char

Meaning

.

Any character except newline

^

Start of string

$

End of string

\A

Start of string (absolute)

\z

End of string (absolute)

\Z

End of string or before final newline

\b

Word boundary

\B

Not word boundary

\G

Position of last match

/^start/        # Must be at start
/end$/          # Must be at end
/\bword\b/      # Whole word match
/\Abegin/       # Absolute start
/finish\z/      # Absolute end

Character Classes#

[abc]           # Match a, b, or c
[^abc]          # Match anything except a, b, c
[a-z]           # Match lowercase letter
[A-Z0-9]        # Match uppercase or digit
[a-zA-Z_]       # Match word character

Predefined Character Classes#

Class

Matches

Negated

\d

Digit [0-9]

\D (non-digit)

\w

Word char [a-zA-Z0-9_]

\W (non-word)

\s

Whitespace [ \t\n\r\f]

\S (non-whitespace)

\h

Horizontal whitespace

\H

\v

Vertical whitespace

\V

/\d+/           # One or more digits
/\w+/           # One or more word chars
/\s*/           # Zero or more spaces

POSIX Character Classes#

[:alnum:]       # Alphanumeric [a-zA-Z0-9]
[:alpha:]       # Alphabetic [a-zA-Z]
[:ascii:]       # ASCII characters [0-127]
[:blank:]       # Space and tab
[:cntrl:]       # Control characters
[:digit:]       # Digits [0-9]
[:graph:]       # Visible characters (not space)
[:lower:]       # Lowercase letters
[:print:]       # Printable characters
[:punct:]       # Punctuation
[:space:]       # Whitespace
[:upper:]       # Uppercase letters
[:word:]        # Word characters [a-zA-Z0-9_]
[:xdigit:]      # Hex digits [0-9A-Fa-f]

Usage: [[:digit:]] or [[:alpha:][:digit:]]

Quantifiers#

Quantifier

Meaning

Greedy

Non-greedy

Possessive

*

0 or more

Yes

*?

*+

+

1 or more

Yes

+?

++

?

0 or 1

Yes

??

?+

{n}

Exactly n

Yes

N/A

N/A

{n,}

n or more

Yes

{n,}?

N/A

{n,m}

n to m

Yes

{n,m}?

N/A

/a*/            # 0 or more 'a' (greedy)
/a*?/           # 0 or more 'a' (non-greedy)
/a+/            # 1 or more 'a'
/a?/            # 0 or 1 'a'
/a{3}/          # Exactly 3 'a'
/a{3,}/         # 3 or more 'a'
/a{3,5}/        # 3 to 5 'a'

Greedy vs Non-greedy: Greedy quantifiers match as much as possible, non-greedy match as little as possible.

# Given: "foo123bar"
/\d+/           # Matches "123" (greedy)
/\d+?/          # Matches "1" (non-greedy, but entire pattern must match)

Groups and Captures#

Capturing Groups#

/(foo)/         # Capture "foo" in $1
/(foo)(bar)/    # Capture in $1 and $2

Access captures with $1, $2, etc., or use @+ array.

Non-Capturing Groups#

/(?:foo)/       # Group but don't capture

Use when grouping is needed but capture overhead is not.

Named Captures#

/(?<name>\w+)/  # Named capture "name"
/(?'name'\w+)/  # Alternative syntax

Access with $+{name} hash.

Alternation#

/foo|bar/       # Match "foo" or "bar"
/(red|green|blue)/ # Capture color

Anchors and Assertions#

Zero-Width Assertions#

Assertion

Meaning

(?=pattern)

Positive lookahead

(?!pattern)

Negative lookahead

(?<=pattern)

Positive lookbehind

(?<!pattern)

Negative lookbehind

/foo(?=bar)/    # "foo" followed by "bar" (bar not consumed)
/foo(?!bar)/    # "foo" not followed by "bar"
/(?<=foo)bar/   # "bar" preceded by "foo"
/(?<!foo)bar/   # "bar" not preceded by "foo"

Atomic Groups#

/(?>pattern)/   # Atomic group (no backtracking)

Once matched, the group’s contents are fixed. Used for performance optimization.

Backreferences#

/(foo)\1/       # Match "foofoo" - \1 references first capture
/(['"]).*?\1/   # Match quoted string (same quote type)

Named Backreferences#

/(?<tag>\w+)...\k<tag>/ # Named backreference

Conditionals#

/(?(condition)yes|no)/ # If condition matches, try "yes", else try "no"

Conditions can be:

  • Capture group number: (?(1)yes|no) - if group 1 matched

  • Named capture: (?(<name>)yes|no) - if named group matched

  • Lookahead: (?(?=test)yes|no) - if lookahead succeeds

Modifiers#

Modifiers change regex behavior. Applied after closing delimiter:

Modifier

Meaning

i

Case-insensitive

m

Multiline (^/$ match line boundaries)

s

Single-line (. matches newline)

x

Extended (ignore whitespace, allow comments)

g

Global (find all matches)

c

Continue searching after failed match

o

Compile once (legacy, not needed in PetaPerl)

e

Evaluate replacement as code (in s///)

/pattern/i      # Case-insensitive
/pattern/ms     # Multiline + single-line
/pattern/x      # Extended (readable)
/pattern/g      # Global matching

Examples#

# Case-insensitive
if ($str =~ /hello/i) { ... }

# Multiline: ^ and $ match line starts/ends
while ($text =~ /^Line: (.+)$/mg) {
    print "Found: $1\n";
}

# Extended: whitespace and comments ignored
my $email_re = qr{
    (\w+)           # Username
    @               # At sign
    ([\w.]+)        # Domain
}x;

# Global: find all matches
my @words = $text =~ /\w+/g;

Matching and Substitution#

Match Operator#

$str =~ /pattern/       # True if matches
$str =~ /pattern/g      # Global, returns all matches

In list context with captures:

my ($user, $domain) = $email =~ /(\w+)@([\w.]+)/;

In list context with global:

my @numbers = $text =~ /\d+/g;  # All numbers

Substitution#

$str =~ s/old/new/      # Replace first occurrence
$str =~ s/old/new/g     # Replace all occurrences
$str =~ s/old/new/i     # Case-insensitive replace
$str =~ s/old/new/gi    # Global + case-insensitive

Replacement with captures:

$str =~ s/(\w+)@(\w+)/$2\@$1/;  # Reverse user@domain

Evaluated replacement:

$str =~ s/(\d+)/$1 * 2/e;       # Double all numbers

Transliteration#

$str =~ tr/abc/xyz/     # Replace a→x, b→y, c→z
$str =~ y/abc/xyz/      # Same as tr
$str =~ tr/a-z/A-Z/     # Uppercase
$str =~ tr/ //d         # Delete spaces
$str =~ tr/a-z//c       # Count non-lowercase

Special Variables#

After a successful match:

Variable

Contains

$&

Entire matched string

$`

String before match

$'

String after match

$1, $2, …

Capture groups

$+

Last matched capture

@+

End positions of captures

@-

Start positions of captures

%+

Named captures

if ($str =~ /(foo)(bar)/) {
    print "Full match: $&\n";     # "foobar"
    print "Group 1: $1\n";        # "foo"
    print "Group 2: $2\n";        # "bar"
    print "Before: $`\n";
    print "After: $'\n";
}

Regex Compilation#

qr// Operator#

Compile regex for reuse:

my $word = qr/\w+/;
my $email = qr/\w+@\w+\.\w+/;

if ($str =~ $word) { ... }
if ($str =~ /$word@$word/) { ... }  # Interpolate

Benefits:

  • Compile once, use many times

  • Readable regex composition

  • Performance optimization

Performance Considerations#

Anchored Patterns#

Patterns anchored with ^ or \A are faster:

/^pattern/      # Fast: only checks start
/pattern/       # Slower: scans entire string

Atomic Groups#

Use atomic groups (?>...) to prevent backtracking:

# Slow: backtracks on failure
/\d+\w+/

# Fast: no backtracking in \d+
/(?>\d+)\w+/

Non-Capturing Groups#

Use (?:...) when captures aren’t needed:

/(?:foo|bar)/   # Faster than /(foo|bar)/ when capture not needed

PetaPerl-Specific Features#

Bytecode Compilation#

Regex patterns compile to bytecode for efficient execution. PetaPerl uses:

  • Bitmap character classes for fast ASCII matching

  • Literal prefix extraction to skip impossible positions

  • Anchored detection to avoid unnecessary scanning

Possessive Quantifiers#

Possessive quantifiers prevent backtracking entirely (more efficient than atomic groups for simple cases):

/a++/           # 1 or more 'a', no backtracking
/a*+/           # 0 or more 'a', no backtracking
/a?+/           # 0 or 1 'a', no backtracking

Embedded Code#

/pattern(?{ code })/    # Execute code during match
/(??{ code })/          # Postponed regex (code returns pattern)

(?{code}) executes Perl code at the point in the pattern where it appears. The code can access $1, $2, etc. from captures made so far.

Current Limitations#

PetaPerl’s regex engine passes 99.3% of perl5’s re_tests suite (1959/1972 tests). Remaining gaps:

  • Self-referential captures — patterns like (a\1) (3 tests)

  • local in code blocks(?{ local $x = ... }) (2 tests)

  • Multi-character case folding — Unicode chars that fold to multiple chars (2 tests)

  • Branch reset backreferences — complex (?|...) with backrefs (5 tests)

  • String interpolation edge case — 1 test

Unicode property support (\p{Letter}, \p{Digit}, etc.) is fully implemented for standard Unicode categories.

Examples#

Email Validation#

my $email_re = qr/^[\w.+-]+@[\w.-]+\.[a-zA-Z]{2,}$/;
if ($email =~ $email_re) {
    print "Valid email\n";
}

URL Parsing#

my ($protocol, $host, $path) = $url =~
    m{^(https?)://([^/]+)(/.*)$};

Log File Parsing#

while ($line =~ /\[(\d{4}-\d{2}-\d{2})\] (\w+): (.+)/g) {
    my ($date, $level, $msg) = ($1, $2, $3);
    # Process log entry
}

String Cleanup#

# Remove multiple spaces
$text =~ s/\s+/ /g;

# Remove leading/trailing whitespace
$text =~ s/^\s+|\s+$//g;

# Or using two substitutions
$text =~ s/^\s+//;
$text =~ s/\s+$//;

Template Substitution#

my %vars = (name => "John", age => 30);
my $template = "Hello {{name}}, you are {{age}} years old.";
$template =~ s/\{\{(\w+)\}\}/$vars{$1}/ge;

See Also#

  • perlop - Binding operators =~ and !~

  • perlvar - Special variables like $&, $1, etc.