Regular Expressions
PetaPerl implements a Perl 5-compatible regex engine with support for the full range of Perl regex features.
Pattern Syntax
Literals
/hello/ # Match literal "hello"
/foo bar/ # Match "foo bar"
Metacharacters
| Char | Meaning |
|---|---|
. | Any character except newline |
^ | Start of string |
$ | End of string |
\A | Start of string (absolute) |
\z | End of string (absolute) |
\Z | End of string or before final newline |
\b | Word boundary |
\B | Not word boundary |
\G | Position of last match |
/^start/ # Must be at start
/end$/ # Must be at end
/\bword\b/ # Whole word match
/\Abegin/ # Absolute start
/finish\z/ # Absolute end
Character Classes
[abc] # Match a, b, or c
[^abc] # Match anything except a, b, c
[a-z] # Match lowercase letter
[A-Z0-9] # Match uppercase or digit
[a-zA-Z_] # Match word character
Predefined Character Classes
| Class | Matches | Negated |
|---|---|---|
\d | Digit [0-9] | \D (non-digit) |
\w | Word char [a-zA-Z0-9_] | \W (non-word) |
\s | Whitespace [ \t\n\r\f] | \S (non-whitespace) |
\h | Horizontal whitespace | \H |
\v | Vertical whitespace | \V |
/\d+/ # One or more digits
/\w+/ # One or more word chars
/\s*/ # Zero or more spaces
POSIX Character Classes
[:alnum:] # Alphanumeric [a-zA-Z0-9]
[:alpha:] # Alphabetic [a-zA-Z]
[:ascii:] # ASCII characters [0-127]
[:blank:] # Space and tab
[:cntrl:] # Control characters
[:digit:] # Digits [0-9]
[:graph:] # Visible characters (not space)
[:lower:] # Lowercase letters
[:print:] # Printable characters
[:punct:] # Punctuation
[:space:] # Whitespace
[:upper:] # Uppercase letters
[:word:] # Word characters [a-zA-Z0-9_]
[:xdigit:] # Hex digits [0-9A-Fa-f]
Usage: [[:digit:]] or [[:alpha:][:digit:]]
Quantifiers
| Quantifier | Meaning | Greedy | Non-greedy | Possessive |
|---|---|---|---|---|
* | 0 or more | Yes | *? | *+ |
+ | 1 or more | Yes | +? | ++ |
? | 0 or 1 | Yes | ?? | ?+ |
{n} | Exactly n | Yes | N/A | N/A |
{n,} | n or more | Yes | {n,}? | N/A |
{n,m} | n to m | Yes | {n,m}? | N/A |
/a*/ # 0 or more 'a' (greedy)
/a*?/ # 0 or more 'a' (non-greedy)
/a+/ # 1 or more 'a'
/a?/ # 0 or 1 'a'
/a{3}/ # Exactly 3 'a'
/a{3,}/ # 3 or more 'a'
/a{3,5}/ # 3 to 5 'a'
Greedy vs Non-greedy: Greedy quantifiers match as much as possible, non-greedy match as little as possible.
# Given: "foo123bar"
/\d+/ # Matches "123" (greedy)
/\d+?/ # Matches "1" (non-greedy, but entire pattern must match)
Groups and Captures
Capturing Groups
/(foo)/ # Capture "foo" in $1
/(foo)(bar)/ # Capture in $1 and $2
Access captures with $1, $2, etc., or use @+ array.
Non-Capturing Groups
/(?:foo)/ # Group but don't capture
Use when grouping is needed but capture overhead is not.
Named Captures
/(?<name>\w+)/ # Named capture "name"
/(?'name'\w+)/ # Alternative syntax
Access with $+{name} hash.
Alternation
/foo|bar/ # Match "foo" or "bar"
/(red|green|blue)/ # Capture color
Anchors and Assertions
Zero-Width Assertions
| Assertion | Meaning |
|---|---|
(?=pattern) | Positive lookahead |
(?!pattern) | Negative lookahead |
(?<=pattern) | Positive lookbehind |
(?<!pattern) | Negative lookbehind |
/foo(?=bar)/ # "foo" followed by "bar" (bar not consumed)
/foo(?!bar)/ # "foo" not followed by "bar"
/(?<=foo)bar/ # "bar" preceded by "foo"
/(?<!foo)bar/ # "bar" not preceded by "foo"
Atomic Groups
/(?>pattern)/ # Atomic group (no backtracking)
Once matched, the group’s contents are fixed. Used for performance optimization.
Backreferences
/(foo)\1/ # Match "foofoo" - \1 references first capture
/(['"]).*?\1/ # Match quoted string (same quote type)
Named Backreferences
/(?<tag>\w+)...\k<tag>/ # Named backreference
Conditionals
/(?(condition)yes|no)/ # If condition matches, try "yes", else try "no"
Conditions can be:
- Capture group number:
(?(1)yes|no)- if group 1 matched - Named capture:
(?(<name>)yes|no)- if named group matched - Lookahead:
(?(?=test)yes|no)- if lookahead succeeds
Modifiers
Modifiers change regex behavior. Applied after closing delimiter:
| Modifier | Meaning |
|---|---|
i | Case-insensitive |
m | Multiline (^/$ match line boundaries) |
s | Single-line (. matches newline) |
x | Extended (ignore whitespace, allow comments) |
g | Global (find all matches) |
c | Continue searching after failed match |
o | Compile once (legacy, not needed in PetaPerl) |
e | Evaluate replacement as code (in s///) |
/pattern/i # Case-insensitive
/pattern/ms # Multiline + single-line
/pattern/x # Extended (readable)
/pattern/g # Global matching
Examples
# Case-insensitive
if ($str =~ /hello/i) { ... }
# Multiline: ^ and $ match line starts/ends
while ($text =~ /^Line: (.+)$/mg) {
print "Found: $1\n";
}
# Extended: whitespace and comments ignored
my $email_re = qr{
(\w+) # Username
@ # At sign
([\w.]+) # Domain
}x;
# Global: find all matches
my @words = $text =~ /\w+/g;
Matching and Substitution
Match Operator
$str =~ /pattern/ # True if matches
$str =~ /pattern/g # Global, returns all matches
In list context with captures:
my ($user, $domain) = $email =~ /(\w+)@([\w.]+)/;
In list context with global:
my @numbers = $text =~ /\d+/g; # All numbers
Substitution
$str =~ s/old/new/ # Replace first occurrence
$str =~ s/old/new/g # Replace all occurrences
$str =~ s/old/new/i # Case-insensitive replace
$str =~ s/old/new/gi # Global + case-insensitive
Replacement with captures:
$str =~ s/(\w+)@(\w+)/$2\@$1/; # Reverse user@domain
Evaluated replacement:
$str =~ s/(\d+)/$1 * 2/e; # Double all numbers
Transliteration
$str =~ tr/abc/xyz/ # Replace a→x, b→y, c→z
$str =~ y/abc/xyz/ # Same as tr
$str =~ tr/a-z/A-Z/ # Uppercase
$str =~ tr/ //d # Delete spaces
$str =~ tr/a-z//c # Count non-lowercase
Special Variables
After a successful match:
| Variable | Contains |
|---|---|
$& | Entire matched string |
$` | String before match |
$' | String after match |
$1, $2, … | Capture groups |
$+ | Last matched capture |
@+ | End positions of captures |
@- | Start positions of captures |
%+ | Named captures |
if ($str =~ /(foo)(bar)/) {
print "Full match: $&\n"; # "foobar"
print "Group 1: $1\n"; # "foo"
print "Group 2: $2\n"; # "bar"
print "Before: $`\n";
print "After: $'\n";
}
Regex Compilation
qr// Operator
Compile regex for reuse:
my $word = qr/\w+/;
my $email = qr/\w+@\w+\.\w+/;
if ($str =~ $word) { ... }
if ($str =~ /$word@$word/) { ... } # Interpolate
Benefits:
- Compile once, use many times
- Readable regex composition
- Performance optimization
Performance Considerations
Anchored Patterns
Patterns anchored with ^ or \A are faster:
/^pattern/ # Fast: only checks start
/pattern/ # Slower: scans entire string
Atomic Groups
Use atomic groups (?>...) to prevent backtracking:
# Slow: backtracks on failure
/\d+\w+/
# Fast: no backtracking in \d+
/(?>\d+)\w+/
Non-Capturing Groups
Use (?:...) when captures aren’t needed:
/(?:foo|bar)/ # Faster than /(foo|bar)/ when capture not needed
PetaPerl-Specific Features
Bytecode Compilation
Regex patterns compile to bytecode for efficient execution. PetaPerl uses:
- Bitmap character classes for fast ASCII matching
- Literal prefix extraction to skip impossible positions
- Anchored detection to avoid unnecessary scanning
Possessive Quantifiers
Possessive quantifiers prevent backtracking entirely (more efficient than atomic groups for simple cases):
/a++/ # 1 or more 'a', no backtracking
/a*+/ # 0 or more 'a', no backtracking
/a?+/ # 0 or 1 'a', no backtracking
Embedded Code
/pattern(?{ code })/ # Execute code during match
/(??{ code })/ # Postponed regex (code returns pattern)
(?{code}) executes Perl code at the point in the pattern where it appears. The code can access $1, $2, etc. from captures made so far.
Current Limitations
PetaPerl’s regex engine passes 99.3% of perl5’s re_tests suite (1959/1972 tests). Remaining gaps:
- Self-referential captures — patterns like
(a\1)(3 tests) localin code blocks —(?{ local $x = ... })(2 tests)- Multi-character case folding — Unicode chars that fold to multiple chars (2 tests)
- Branch reset backreferences — complex
(?|...)with backrefs (5 tests) - String interpolation edge case — 1 test
Unicode property support (\p{Letter}, \p{Digit}, etc.) is fully implemented for standard Unicode categories.
Examples
Email Validation
my $email_re = qr/^[\w.+-]+@[\w.-]+\.[a-zA-Z]{2,}$/;
if ($email =~ $email_re) {
print "Valid email\n";
}
URL Parsing
my ($protocol, $host, $path) = $url =~
m{^(https?)://([^/]+)(/.*)$};
Log File Parsing
while ($line =~ /\[(\d{4}-\d{2}-\d{2})\] (\w+): (.+)/g) {
my ($date, $level, $msg) = ($1, $2, $3);
# Process log entry
}
String Cleanup
# Remove multiple spaces
$text =~ s/\s+/ /g;
# Remove leading/trailing whitespace
$text =~ s/^\s+|\s+$//g;
# Or using two substitutions
$text =~ s/^\s+//;
$text =~ s/\s+$//;
Template Substitution
my %vars = (name => "John", age => 30);
my $template = "Hello {{name}}, you are {{age}} years old.";
$template =~ s/\{\{(\w+)\}\}/$vars{$1}/ge;