Regular Expressions

PetaPerl implements a Perl 5-compatible regex engine with support for the full range of Perl regex features.

Pattern Syntax

Literals

/hello/         # Match literal "hello"
/foo bar/       # Match "foo bar"

Metacharacters

Char	Meaning
`.`	Any character except newline
`^`	Start of string
`$`	End of string
`\A`	Start of string (absolute)
`\z`	End of string (absolute)
`\Z`	End of string or before final newline
`\b`	Word boundary
`\B`	Not word boundary
`\G`	Position of last match

/^start/        # Must be at start
/end$/          # Must be at end
/\bword\b/      # Whole word match
/\Abegin/       # Absolute start
/finish\z/      # Absolute end

Character Classes

[abc]           # Match a, b, or c
[^abc]          # Match anything except a, b, c
[a-z]           # Match lowercase letter
[A-Z0-9]        # Match uppercase or digit
[a-zA-Z_]       # Match word character

Predefined Character Classes

Class	Matches	Negated
`\d`	Digit [0-9]	`\D` (non-digit)
`\w`	Word char [a-zA-Z0-9_]	`\W` (non-word)
`\s`	Whitespace [ \t\n\r\f]	`\S` (non-whitespace)
`\h`	Horizontal whitespace	`\H`
`\v`	Vertical whitespace	`\V`

/\d+/           # One or more digits
/\w+/           # One or more word chars
/\s*/           # Zero or more spaces

POSIX Character Classes

[:alnum:]       # Alphanumeric [a-zA-Z0-9]
[:alpha:]       # Alphabetic [a-zA-Z]
[:ascii:]       # ASCII characters [0-127]
[:blank:]       # Space and tab
[:cntrl:]       # Control characters
[:digit:]       # Digits [0-9]
[:graph:]       # Visible characters (not space)
[:lower:]       # Lowercase letters
[:print:]       # Printable characters
[:punct:]       # Punctuation
[:space:]       # Whitespace
[:upper:]       # Uppercase letters
[:word:]        # Word characters [a-zA-Z0-9_]
[:xdigit:]      # Hex digits [0-9A-Fa-f]

Usage: [[:digit:]] or [[:alpha:][:digit:]]

Quantifiers

Quantifier	Meaning	Greedy	Non-greedy	Possessive
`*`	0 or more	Yes	`*?`	`*+`
`+`	1 or more	Yes	`+?`	`++`
`?`	0 or 1	Yes	`??`	`?+`
`{n}`	Exactly n	Yes	N/A	N/A
`{n,}`	n or more	Yes	`{n,}?`	N/A
`{n,m}`	n to m	Yes	`{n,m}?`	N/A

/a*/            # 0 or more 'a' (greedy)
/a*?/           # 0 or more 'a' (non-greedy)
/a+/            # 1 or more 'a'
/a?/            # 0 or 1 'a'
/a{3}/          # Exactly 3 'a'
/a{3,}/         # 3 or more 'a'
/a{3,5}/        # 3 to 5 'a'

Greedy vs Non-greedy: Greedy quantifiers match as much as possible, non-greedy match as little as possible.

# Given: "foo123bar"
/\d+/           # Matches "123" (greedy)
/\d+?/          # Matches "1" (non-greedy, but entire pattern must match)

Groups and Captures

Capturing Groups

/(foo)/         # Capture "foo" in $1
/(foo)(bar)/    # Capture in $1 and $2

Access captures with $1, $2, etc., or use @+ array.

Non-Capturing Groups

/(?:foo)/       # Group but don't capture

Use when grouping is needed but capture overhead is not.

Named Captures

/(?<name>\w+)/  # Named capture "name"
/(?'name'\w+)/  # Alternative syntax

Access with $+{name} hash.

Alternation

/foo|bar/       # Match "foo" or "bar"
/(red|green|blue)/ # Capture color

Anchors and Assertions

Zero-Width Assertions

Assertion	Meaning
`(?=pattern)`	Positive lookahead
`(?!pattern)`	Negative lookahead
`(?<=pattern)`	Positive lookbehind
`(?<!pattern)`	Negative lookbehind

/foo(?=bar)/    # "foo" followed by "bar" (bar not consumed)
/foo(?!bar)/    # "foo" not followed by "bar"
/(?<=foo)bar/   # "bar" preceded by "foo"
/(?<!foo)bar/   # "bar" not preceded by "foo"

Atomic Groups

/(?>pattern)/   # Atomic group (no backtracking)

Once matched, the group’s contents are fixed. Used for performance optimization.

Backreferences

/(foo)\1/       # Match "foofoo" - \1 references first capture
/(['"]).*?\1/   # Match quoted string (same quote type)

Named Backreferences

/(?<tag>\w+)...\k<tag>/ # Named backreference

Conditionals

/(?(condition)yes|no)/ # If condition matches, try "yes", else try "no"

Conditions can be:

Capture group number: (?(1)yes|no) - if group 1 matched
Named capture: (?(<name>)yes|no) - if named group matched
Lookahead: (?(?=test)yes|no) - if lookahead succeeds

Modifiers

Modifiers change regex behavior. Applied after closing delimiter:

Modifier	Meaning
`i`	Case-insensitive
`m`	Multiline (^/$ match line boundaries)
`s`	Single-line (. matches newline)
`x`	Extended (ignore whitespace, allow comments)
`g`	Global (find all matches)
`c`	Continue searching after failed match
`o`	Compile once (legacy, not needed in PetaPerl)
`e`	Evaluate replacement as code (in s///)

/pattern/i      # Case-insensitive
/pattern/ms     # Multiline + single-line
/pattern/x      # Extended (readable)
/pattern/g      # Global matching

Examples

# Case-insensitive
if ($str =~ /hello/i) { ... }

# Multiline: ^ and $ match line starts/ends
while ($text =~ /^Line: (.+)$/mg) {
    print "Found: $1\n";
}

# Extended: whitespace and comments ignored
my $email_re = qr{
    (\w+)           # Username
    @               # At sign
    ([\w.]+)        # Domain
}x;

# Global: find all matches
my @words = $text =~ /\w+/g;

Matching and Substitution

Match Operator

$str =~ /pattern/       # True if matches
$str =~ /pattern/g      # Global, returns all matches

In list context with captures:

my ($user, $domain) = $email =~ /(\w+)@([\w.]+)/;

In list context with global:

my @numbers = $text =~ /\d+/g;  # All numbers

Substitution

$str =~ s/old/new/      # Replace first occurrence
$str =~ s/old/new/g     # Replace all occurrences
$str =~ s/old/new/i     # Case-insensitive replace
$str =~ s/old/new/gi    # Global + case-insensitive

Replacement with captures:

$str =~ s/(\w+)@(\w+)/$2\@$1/;  # Reverse user@domain

Evaluated replacement:

$str =~ s/(\d+)/$1 * 2/e;       # Double all numbers

Transliteration

$str =~ tr/abc/xyz/     # Replace a→x, b→y, c→z
$str =~ y/abc/xyz/      # Same as tr
$str =~ tr/a-z/A-Z/     # Uppercase
$str =~ tr/ //d         # Delete spaces
$str =~ tr/a-z//c       # Count non-lowercase

Special Variables

After a successful match:

Variable	Contains
`$&`	Entire matched string
$`	String before match
`$'`	String after match
`$1`, `$2`, …	Capture groups
`$+`	Last matched capture
`@+`	End positions of captures
`@-`	Start positions of captures
`%+`	Named captures

if ($str =~ /(foo)(bar)/) {
    print "Full match: $&\n";     # "foobar"
    print "Group 1: $1\n";        # "foo"
    print "Group 2: $2\n";        # "bar"
    print "Before: $`\n";
    print "After: $'\n";
}

Regex Compilation

qr// Operator

Compile regex for reuse:

my $word = qr/\w+/;
my $email = qr/\w+@\w+\.\w+/;

if ($str =~ $word) { ... }
if ($str =~ /$word@$word/) { ... }  # Interpolate

Benefits:

Compile once, use many times
Readable regex composition
Performance optimization

Performance Considerations

Anchored Patterns

Patterns anchored with ^ or \A are faster:

/^pattern/      # Fast: only checks start
/pattern/       # Slower: scans entire string

Atomic Groups

Use atomic groups (?>...) to prevent backtracking:

# Slow: backtracks on failure
/\d+\w+/

# Fast: no backtracking in \d+
/(?>\d+)\w+/

Non-Capturing Groups

Use (?:...) when captures aren’t needed:

/(?:foo|bar)/   # Faster than /(foo|bar)/ when capture not needed

PetaPerl-Specific Features

Bytecode Compilation

Regex patterns compile to bytecode for efficient execution. PetaPerl uses:

Bitmap character classes for fast ASCII matching
Literal prefix extraction to skip impossible positions
Anchored detection to avoid unnecessary scanning

Possessive Quantifiers

Possessive quantifiers prevent backtracking entirely (more efficient than atomic groups for simple cases):

/a++/           # 1 or more 'a', no backtracking
/a*+/           # 0 or more 'a', no backtracking
/a?+/           # 0 or 1 'a', no backtracking

Embedded Code

/pattern(?{ code })/    # Execute code during match
/(??{ code })/          # Postponed regex (code returns pattern)

(?{code}) executes Perl code at the point in the pattern where it appears. The code can access $1, $2, etc. from captures made so far.

Current Limitations

PetaPerl’s regex engine passes 99.3% of perl5’s re_tests suite (1959/1972 tests). Remaining gaps:

Self-referential captures — patterns like (a\1) (3 tests)
local in code blocks — (?{ local $x = ... }) (2 tests)
Multi-character case folding — Unicode chars that fold to multiple chars (2 tests)
Branch reset backreferences — complex (?|...) with backrefs (5 tests)
String interpolation edge case — 1 test

Unicode property support (\p{Letter}, \p{Digit}, etc.) is fully implemented for standard Unicode categories.

Examples

Email Validation

my $email_re = qr/^[\w.+-]+@[\w.-]+\.[a-zA-Z]{2,}$/;
if ($email =~ $email_re) {
    print "Valid email\n";
}

URL Parsing

my ($protocol, $host, $path) = $url =~
    m{^(https?)://([^/]+)(/.*)$};

Log File Parsing

while ($line =~ /\[(\d{4}-\d{2}-\d{2})\] (\w+): (.+)/g) {
    my ($date, $level, $msg) = ($1, $2, $3);
    # Process log entry
}

String Cleanup

# Remove multiple spaces
$text =~ s/\s+/ /g;

# Remove leading/trailing whitespace
$text =~ s/^\s+|\s+$//g;

# Or using two substitutions
$text =~ s/^\s+//;
$text =~ s/\s+$//;

Template Substitution

my %vars = (name => "John", age => 30);
my $template = "Hello {{name}}, you are {{age}} years old.";
$template =~ s/\{\{(\w+)\}\}/$vars{$1}/ge;

Keyboard shortcuts

PetaPerl Documentation