Regular Expressions#
PetaPerl implements a Perl 5-compatible regex engine with support for the full range of Perl regex features.
Pattern Syntax#
Literals#
/hello/ # Match literal "hello"
/foo bar/ # Match "foo bar"
Metacharacters#
Char |
Meaning |
|---|---|
|
Any character except newline |
|
Start of string |
|
End of string |
|
Start of string (absolute) |
|
End of string (absolute) |
|
End of string or before final newline |
|
Word boundary |
|
Not word boundary |
|
Position of last match |
/^start/ # Must be at start
/end$/ # Must be at end
/\bword\b/ # Whole word match
/\Abegin/ # Absolute start
/finish\z/ # Absolute end
Character Classes#
[abc] # Match a, b, or c
[^abc] # Match anything except a, b, c
[a-z] # Match lowercase letter
[A-Z0-9] # Match uppercase or digit
[a-zA-Z_] # Match word character
Predefined Character Classes#
Class |
Matches |
Negated |
|---|---|---|
|
Digit [0-9] |
|
|
Word char [a-zA-Z0-9_] |
|
|
Whitespace [ \t\n\r\f] |
|
|
Horizontal whitespace |
|
|
Vertical whitespace |
|
/\d+/ # One or more digits
/\w+/ # One or more word chars
/\s*/ # Zero or more spaces
POSIX Character Classes#
[:alnum:] # Alphanumeric [a-zA-Z0-9]
[:alpha:] # Alphabetic [a-zA-Z]
[:ascii:] # ASCII characters [0-127]
[:blank:] # Space and tab
[:cntrl:] # Control characters
[:digit:] # Digits [0-9]
[:graph:] # Visible characters (not space)
[:lower:] # Lowercase letters
[:print:] # Printable characters
[:punct:] # Punctuation
[:space:] # Whitespace
[:upper:] # Uppercase letters
[:word:] # Word characters [a-zA-Z0-9_]
[:xdigit:] # Hex digits [0-9A-Fa-f]
Usage: [[:digit:]] or [[:alpha:][:digit:]]
Quantifiers#
Quantifier |
Meaning |
Greedy |
Non-greedy |
Possessive |
|---|---|---|---|---|
|
0 or more |
Yes |
|
|
|
1 or more |
Yes |
|
|
|
0 or 1 |
Yes |
|
|
|
Exactly n |
Yes |
N/A |
N/A |
|
n or more |
Yes |
|
N/A |
|
n to m |
Yes |
|
N/A |
/a*/ # 0 or more 'a' (greedy)
/a*?/ # 0 or more 'a' (non-greedy)
/a+/ # 1 or more 'a'
/a?/ # 0 or 1 'a'
/a{3}/ # Exactly 3 'a'
/a{3,}/ # 3 or more 'a'
/a{3,5}/ # 3 to 5 'a'
Greedy vs Non-greedy: Greedy quantifiers match as much as possible, non-greedy match as little as possible.
# Given: "foo123bar"
/\d+/ # Matches "123" (greedy)
/\d+?/ # Matches "1" (non-greedy, but entire pattern must match)
Groups and Captures#
Capturing Groups#
/(foo)/ # Capture "foo" in $1
/(foo)(bar)/ # Capture in $1 and $2
Access captures with $1, $2, etc., or use @+ array.
Non-Capturing Groups#
/(?:foo)/ # Group but don't capture
Use when grouping is needed but capture overhead is not.
Named Captures#
/(?<name>\w+)/ # Named capture "name"
/(?'name'\w+)/ # Alternative syntax
Access with $+{name} hash.
Alternation#
/foo|bar/ # Match "foo" or "bar"
/(red|green|blue)/ # Capture color
Anchors and Assertions#
Zero-Width Assertions#
Assertion |
Meaning |
|---|---|
|
Positive lookahead |
|
Negative lookahead |
|
Positive lookbehind |
|
Negative lookbehind |
/foo(?=bar)/ # "foo" followed by "bar" (bar not consumed)
/foo(?!bar)/ # "foo" not followed by "bar"
/(?<=foo)bar/ # "bar" preceded by "foo"
/(?<!foo)bar/ # "bar" not preceded by "foo"
Atomic Groups#
/(?>pattern)/ # Atomic group (no backtracking)
Once matched, the group’s contents are fixed. Used for performance optimization.
Backreferences#
/(foo)\1/ # Match "foofoo" - \1 references first capture
/(['"]).*?\1/ # Match quoted string (same quote type)
Named Backreferences#
/(?<tag>\w+)...\k<tag>/ # Named backreference
Conditionals#
/(?(condition)yes|no)/ # If condition matches, try "yes", else try "no"
Conditions can be:
Capture group number:
(?(1)yes|no)- if group 1 matchedNamed capture:
(?(<name>)yes|no)- if named group matchedLookahead:
(?(?=test)yes|no)- if lookahead succeeds
Modifiers#
Modifiers change regex behavior. Applied after closing delimiter:
Modifier |
Meaning |
|---|---|
|
Case-insensitive |
|
Multiline (^/$ match line boundaries) |
|
Single-line (. matches newline) |
|
Extended (ignore whitespace, allow comments) |
|
Global (find all matches) |
|
Continue searching after failed match |
|
Compile once (legacy, not needed in PetaPerl) |
|
Evaluate replacement as code (in s///) |
/pattern/i # Case-insensitive
/pattern/ms # Multiline + single-line
/pattern/x # Extended (readable)
/pattern/g # Global matching
Examples#
# Case-insensitive
if ($str =~ /hello/i) { ... }
# Multiline: ^ and $ match line starts/ends
while ($text =~ /^Line: (.+)$/mg) {
print "Found: $1\n";
}
# Extended: whitespace and comments ignored
my $email_re = qr{
(\w+) # Username
@ # At sign
([\w.]+) # Domain
}x;
# Global: find all matches
my @words = $text =~ /\w+/g;
Matching and Substitution#
Match Operator#
$str =~ /pattern/ # True if matches
$str =~ /pattern/g # Global, returns all matches
In list context with captures:
my ($user, $domain) = $email =~ /(\w+)@([\w.]+)/;
In list context with global:
my @numbers = $text =~ /\d+/g; # All numbers
Substitution#
$str =~ s/old/new/ # Replace first occurrence
$str =~ s/old/new/g # Replace all occurrences
$str =~ s/old/new/i # Case-insensitive replace
$str =~ s/old/new/gi # Global + case-insensitive
Replacement with captures:
$str =~ s/(\w+)@(\w+)/$2\@$1/; # Reverse user@domain
Evaluated replacement:
$str =~ s/(\d+)/$1 * 2/e; # Double all numbers
Transliteration#
$str =~ tr/abc/xyz/ # Replace a→x, b→y, c→z
$str =~ y/abc/xyz/ # Same as tr
$str =~ tr/a-z/A-Z/ # Uppercase
$str =~ tr/ //d # Delete spaces
$str =~ tr/a-z//c # Count non-lowercase
Special Variables#
After a successful match:
Variable |
Contains |
|---|---|
|
Entire matched string |
|
String before match |
|
String after match |
|
Capture groups |
|
Last matched capture |
|
End positions of captures |
|
Start positions of captures |
|
Named captures |
if ($str =~ /(foo)(bar)/) {
print "Full match: $&\n"; # "foobar"
print "Group 1: $1\n"; # "foo"
print "Group 2: $2\n"; # "bar"
print "Before: $`\n";
print "After: $'\n";
}
Regex Compilation#
qr// Operator#
Compile regex for reuse:
my $word = qr/\w+/;
my $email = qr/\w+@\w+\.\w+/;
if ($str =~ $word) { ... }
if ($str =~ /$word@$word/) { ... } # Interpolate
Benefits:
Compile once, use many times
Readable regex composition
Performance optimization
Performance Considerations#
Anchored Patterns#
Patterns anchored with ^ or \A are faster:
/^pattern/ # Fast: only checks start
/pattern/ # Slower: scans entire string
Atomic Groups#
Use atomic groups (?>...) to prevent backtracking:
# Slow: backtracks on failure
/\d+\w+/
# Fast: no backtracking in \d+
/(?>\d+)\w+/
Non-Capturing Groups#
Use (?:...) when captures aren’t needed:
/(?:foo|bar)/ # Faster than /(foo|bar)/ when capture not needed
PetaPerl-Specific Features#
Bytecode Compilation#
Regex patterns compile to bytecode for efficient execution. PetaPerl uses:
Bitmap character classes for fast ASCII matching
Literal prefix extraction to skip impossible positions
Anchored detection to avoid unnecessary scanning
Possessive Quantifiers#
Possessive quantifiers prevent backtracking entirely (more efficient than atomic groups for simple cases):
/a++/ # 1 or more 'a', no backtracking
/a*+/ # 0 or more 'a', no backtracking
/a?+/ # 0 or 1 'a', no backtracking
Embedded Code#
/pattern(?{ code })/ # Execute code during match
/(??{ code })/ # Postponed regex (code returns pattern)
(?{code}) executes Perl code at the point in the pattern where it appears. The code can access $1, $2, etc. from captures made so far.
Current Limitations#
PetaPerl’s regex engine passes 99.3% of perl5’s re_tests suite (1959/1972 tests). Remaining gaps:
Self-referential captures — patterns like
(a\1)(3 tests)localin code blocks —(?{ local $x = ... })(2 tests)Multi-character case folding — Unicode chars that fold to multiple chars (2 tests)
Branch reset backreferences — complex
(?|...)with backrefs (5 tests)String interpolation edge case — 1 test
Unicode property support (\p{Letter}, \p{Digit}, etc.) is fully implemented for standard Unicode categories.
Examples#
Email Validation#
my $email_re = qr/^[\w.+-]+@[\w.-]+\.[a-zA-Z]{2,}$/;
if ($email =~ $email_re) {
print "Valid email\n";
}
URL Parsing#
my ($protocol, $host, $path) = $url =~
m{^(https?)://([^/]+)(/.*)$};
Log File Parsing#
while ($line =~ /\[(\d{4}-\d{2}-\d{2})\] (\w+): (.+)/g) {
my ($date, $level, $msg) = ($1, $2, $3);
# Process log entry
}
String Cleanup#
# Remove multiple spaces
$text =~ s/\s+/ /g;
# Remove leading/trailing whitespace
$text =~ s/^\s+|\s+$//g;
# Or using two substitutions
$text =~ s/^\s+//;
$text =~ s/\s+$//;
Template Substitution#
my %vars = (name => "John", age => 30);
my $template = "Hello {{name}}, you are {{age}} years old.";
$template =~ s/\{\{(\w+)\}\}/$vars{$1}/ge;