Regular Expressions#

PetaPerl implements a Perl 5-compatible regex engine with support for the full range of Perl regex features.

Pattern Syntax#

Literals#

/hello/         # Match literal "hello"
/foo bar/       # Match "foo bar"

Metacharacters#

Char	Meaning
`.`	Any character except newline
`^`	Start of string
`$`	End of string
`\A`	Start of string (absolute)
`\z`	End of string (absolute)
`\Z`	End of string or before final newline
`\b`	Word boundary
`\B`	Not word boundary
`\G`	Position of last match

/^start/        # Must be at start
/end$/          # Must be at end
/\bword\b/      # Whole word match
/\Abegin/       # Absolute start
/finish\z/      # Absolute end

Character Classes#

[abc]           # Match a, b, or c
[^abc]          # Match anything except a, b, c
[a-z]           # Match lowercase letter
[A-Z0-9]        # Match uppercase or digit
[a-zA-Z_]       # Match word character

Predefined Character Classes#

Class	Matches	Negated
`\d`	Digit [0-9]	`\D` (non-digit)
`\w`	Word char [a-zA-Z0-9_]	`\W` (non-word)
`\s`	Whitespace [ \t\n\r\f]	`\S` (non-whitespace)
`\h`	Horizontal whitespace	`\H`
`\v`	Vertical whitespace	`\V`

/\d+/           # One or more digits
/\w+/           # One or more word chars
/\s*/           # Zero or more spaces

POSIX Character Classes#

[:alnum:]       # Alphanumeric [a-zA-Z0-9]
[:alpha:]       # Alphabetic [a-zA-Z]
[:ascii:]       # ASCII characters [0-127]
[:blank:]       # Space and tab
[:cntrl:]       # Control characters
[:digit:]       # Digits [0-9]
[:graph:]       # Visible characters (not space)
[:lower:]       # Lowercase letters
[:print:]       # Printable characters
[:punct:]       # Punctuation
[:space:]       # Whitespace
[:upper:]       # Uppercase letters
[:word:]        # Word characters [a-zA-Z0-9_]
[:xdigit:]      # Hex digits [0-9A-Fa-f]

Usage: [[:digit:]] or [[:alpha:][:digit:]]

Quantifiers#

Quantifier	Meaning	Greedy	Non-greedy	Possessive
`*`	0 or more	Yes	`*?`	`*+`
`+`	1 or more	Yes	`+?`	`++`
`?`	0 or 1	Yes	`??`	`?+`
`{n}`	Exactly n	Yes	N/A	N/A
`{n,}`	n or more	Yes	`{n,}?`	N/A
`{n,m}`	n to m	Yes	`{n,m}?`	N/A

/a*/            # 0 or more 'a' (greedy)
/a*?/           # 0 or more 'a' (non-greedy)
/a+/            # 1 or more 'a'
/a?/            # 0 or 1 'a'
/a{3}/          # Exactly 3 'a'
/a{3,}/         # 3 or more 'a'
/a{3,5}/        # 3 to 5 'a'

Greedy vs Non-greedy: Greedy quantifiers match as much as possible, non-greedy match as little as possible.

# Given: "foo123bar"
/\d+/           # Matches "123" (greedy)
/\d+?/          # Matches "1" (non-greedy, but entire pattern must match)

Groups and Captures#

Capturing Groups#

/(foo)/         # Capture "foo" in $1
/(foo)(bar)/    # Capture in $1 and $2

Access captures with $1, $2, etc., or use @+ array.

Non-Capturing Groups#

/(?:foo)/       # Group but don't capture

Use when grouping is needed but capture overhead is not.

Named Captures#

/(?<name>\w+)/  # Named capture "name"
/(?'name'\w+)/  # Alternative syntax

Access with $+{name} hash.

Alternation#

/foo|bar/       # Match "foo" or "bar"
/(red|green|blue)/ # Capture color

Anchors and Assertions#

Zero-Width Assertions#

Assertion	Meaning
`(?=pattern)`	Positive lookahead
`(?!pattern)`	Negative lookahead
`(?<=pattern)`	Positive lookbehind
`(?<!pattern)`	Negative lookbehind

/foo(?=bar)/    # "foo" followed by "bar" (bar not consumed)
/foo(?!bar)/    # "foo" not followed by "bar"
/(?<=foo)bar/   # "bar" preceded by "foo"
/(?<!foo)bar/   # "bar" not preceded by "foo"

Atomic Groups#

/(?>pattern)/   # Atomic group (no backtracking)

Once matched, the group’s contents are fixed. Used for performance optimization.

Backreferences#

/(foo)\1/       # Match "foofoo" - \1 references first capture
/(['"]).*?\1/   # Match quoted string (same quote type)

Named Backreferences#

/(?<tag>\w+)...\k<tag>/ # Named backreference

Conditionals#

/(?(condition)yes|no)/ # If condition matches, try "yes", else try "no"

Conditions can be:

Capture group number: (?(1)yes|no) - if group 1 matched
Named capture: (?(<name>)yes|no) - if named group matched
Lookahead: (?(?=test)yes|no) - if lookahead succeeds

Modifiers#

Modifiers change regex behavior. Applied after closing delimiter:

Modifier	Meaning
`i`	Case-insensitive
`m`	Multiline (^/$ match line boundaries)
`s`	Single-line (. matches newline)
`x`	Extended (ignore whitespace, allow comments)
`g`	Global (find all matches)
`c`	Continue searching after failed match
`o`	Compile once (legacy, not needed in PetaPerl)
`e`	Evaluate replacement as code (in s///)

/pattern/i      # Case-insensitive
/pattern/ms     # Multiline + single-line
/pattern/x      # Extended (readable)
/pattern/g      # Global matching

Examples#

# Case-insensitive
if ($str =~ /hello/i) { ... }

# Multiline: ^ and $ match line starts/ends
while ($text =~ /^Line: (.+)$/mg) {
    print "Found: $1\n";
}

# Extended: whitespace and comments ignored
my $email_re = qr{
    (\w+)           # Username
    @               # At sign
    ([\w.]+)        # Domain
}x;

# Global: find all matches
my @words = $text =~ /\w+/g;

Matching and Substitution#

Match Operator#

$str =~ /pattern/       # True if matches
$str =~ /pattern/g      # Global, returns all matches

In list context with captures:

my ($user, $domain) = $email =~ /(\w+)@([\w.]+)/;

In list context with global:

my @numbers = $text =~ /\d+/g;  # All numbers

Substitution#

$str =~ s/old/new/      # Replace first occurrence
$str =~ s/old/new/g     # Replace all occurrences
$str =~ s/old/new/i     # Case-insensitive replace
$str =~ s/old/new/gi    # Global + case-insensitive

Replacement with captures:

$str =~ s/(\w+)@(\w+)/$2\@$1/;  # Reverse user@domain

Evaluated replacement:

$str =~ s/(\d+)/$1 * 2/e;       # Double all numbers

Transliteration#

$str =~ tr/abc/xyz/     # Replace a→x, b→y, c→z
$str =~ y/abc/xyz/      # Same as tr
$str =~ tr/a-z/A-Z/     # Uppercase
$str =~ tr/ //d         # Delete spaces
$str =~ tr/a-z//c       # Count non-lowercase

Special Variables#

After a successful match:

Variable	Contains
`$&`	Entire matched string
$`	String before match
`$'`	String after match
`$1`, `$2`, …	Capture groups
`$+`	Last matched capture
`@+`	End positions of captures
`@-`	Start positions of captures
`%+`	Named captures

if ($str =~ /(foo)(bar)/) {
    print "Full match: $&\n";     # "foobar"
    print "Group 1: $1\n";        # "foo"
    print "Group 2: $2\n";        # "bar"
    print "Before: $`\n";
    print "After: $'\n";
}

Regex Compilation#

qr// Operator#

Compile regex for reuse:

my $word = qr/\w+/;
my $email = qr/\w+@\w+\.\w+/;

if ($str =~ $word) { ... }
if ($str =~ /$word@$word/) { ... }  # Interpolate

Benefits:

Compile once, use many times
Readable regex composition
Performance optimization

Performance Considerations#

Anchored Patterns#

Patterns anchored with ^ or \A are faster:

/^pattern/      # Fast: only checks start
/pattern/       # Slower: scans entire string

Atomic Groups#

Use atomic groups (?>...) to prevent backtracking:

# Slow: backtracks on failure
/\d+\w+/

# Fast: no backtracking in \d+
/(?>\d+)\w+/

Non-Capturing Groups#

Use (?:...) when captures aren’t needed:

/(?:foo|bar)/   # Faster than /(foo|bar)/ when capture not needed

PetaPerl-Specific Features#

Bytecode Compilation#

Regex patterns compile to bytecode for efficient execution. PetaPerl uses:

Bitmap character classes for fast ASCII matching
Literal prefix extraction to skip impossible positions
Anchored detection to avoid unnecessary scanning

Possessive Quantifiers#

Possessive quantifiers prevent backtracking entirely (more efficient than atomic groups for simple cases):

/a++/           # 1 or more 'a', no backtracking
/a*+/           # 0 or more 'a', no backtracking
/a?+/           # 0 or 1 'a', no backtracking

Embedded Code#

/pattern(?{ code })/    # Execute code during match
/(??{ code })/          # Postponed regex (code returns pattern)

(?{code}) executes Perl code at the point in the pattern where it appears. The code can access $1, $2, etc. from captures made so far.

Current Limitations#

PetaPerl’s regex engine passes 99.3% of perl5’s re_tests suite (1959/1972 tests). Remaining gaps:

Self-referential captures — patterns like (a\1) (3 tests)
local in code blocks — (?{ local $x = ... }) (2 tests)
Multi-character case folding — Unicode chars that fold to multiple chars (2 tests)
Branch reset backreferences — complex (?|...) with backrefs (5 tests)
String interpolation edge case — 1 test

Unicode property support (\p{Letter}, \p{Digit}, etc.) is fully implemented for standard Unicode categories.

Examples#

Email Validation#

my $email_re = qr/^[\w.+-]+@[\w.-]+\.[a-zA-Z]{2,}$/;
if ($email =~ $email_re) {
    print "Valid email\n";
}

URL Parsing#

my ($protocol, $host, $path) = $url =~
    m{^(https?)://([^/]+)(/.*)$};

Log File Parsing#

while ($line =~ /\[(\d{4}-\d{2}-\d{2})\] (\w+): (.+)/g) {
    my ($date, $level, $msg) = ($1, $2, $3);
    # Process log entry
}

String Cleanup#

# Remove multiple spaces
$text =~ s/\s+/ /g;

# Remove leading/trailing whitespace
$text =~ s/^\s+|\s+$//g;

# Or using two substitutions
$text =~ s/^\s+//;
$text =~ s/\s+$//;

Template Substitution#

my %vars = (name => "John", age => 30);
my $template = "Hello {{name}}, you are {{age}} years old.";
$template =~ s/\{\{(\w+)\}\}/$vars{$1}/ge;

Regular Expressions#

Pattern Syntax#

Literals#

Metacharacters#

Character Classes#

Predefined Character Classes#

POSIX Character Classes#

Quantifiers#

Groups and Captures#

Capturing Groups#

Non-Capturing Groups#

Named Captures#

Alternation#

Anchors and Assertions#

Zero-Width Assertions#

Atomic Groups#

Backreferences#

Named Backreferences#

Conditionals#

Modifiers#

Examples#

Matching and Substitution#

Match Operator#

Substitution#

Transliteration#

Special Variables#

Regex Compilation#

qr// Operator#

Performance Considerations#

Anchored Patterns#

Atomic Groups#

Non-Capturing Groups#

PetaPerl-Specific Features#

Bytecode Compilation#

Possessive Quantifiers#

Embedded Code#

Current Limitations#

Examples#

Email Validation#

URL Parsing#

Log File Parsing#

String Cleanup#

Template Substitution#

See Also#