# Regular Expressions PetaPerl implements a Perl 5-compatible regex engine with support for the full range of Perl regex features. ## Pattern Syntax ### Literals ```perl /hello/ # Match literal "hello" /foo bar/ # Match "foo bar" ``` ### Metacharacters | Char | Meaning | |------|---------| | `.` | Any character except newline | | `^` | Start of string | | `$` | End of string | | `\A` | Start of string (absolute) | | `\z` | End of string (absolute) | | `\Z` | End of string or before final newline | | `\b` | Word boundary | | `\B` | Not word boundary | | `\G` | Position of last match | ```perl /^start/ # Must be at start /end$/ # Must be at end /\bword\b/ # Whole word match /\Abegin/ # Absolute start /finish\z/ # Absolute end ``` ### Character Classes ```perl [abc] # Match a, b, or c [^abc] # Match anything except a, b, c [a-z] # Match lowercase letter [A-Z0-9] # Match uppercase or digit [a-zA-Z_] # Match word character ``` #### Predefined Character Classes | Class | Matches | Negated | |-------|---------|---------| | `\d` | Digit [0-9] | `\D` (non-digit) | | `\w` | Word char [a-zA-Z0-9_] | `\W` (non-word) | | `\s` | Whitespace [ \t\n\r\f] | `\S` (non-whitespace) | | `\h` | Horizontal whitespace | `\H` | | `\v` | Vertical whitespace | `\V` | ```perl /\d+/ # One or more digits /\w+/ # One or more word chars /\s*/ # Zero or more spaces ``` #### POSIX Character Classes ```perl [:alnum:] # Alphanumeric [a-zA-Z0-9] [:alpha:] # Alphabetic [a-zA-Z] [:ascii:] # ASCII characters [0-127] [:blank:] # Space and tab [:cntrl:] # Control characters [:digit:] # Digits [0-9] [:graph:] # Visible characters (not space) [:lower:] # Lowercase letters [:print:] # Printable characters [:punct:] # Punctuation [:space:] # Whitespace [:upper:] # Uppercase letters [:word:] # Word characters [a-zA-Z0-9_] [:xdigit:] # Hex digits [0-9A-Fa-f] ``` Usage: `[[:digit:]]` or `[[:alpha:][:digit:]]` ### Quantifiers | Quantifier | Meaning | Greedy | Non-greedy | Possessive | |------------|---------|--------|------------|------------| | `*` | 0 or more | Yes | `*?` | `*+` | | `+` | 1 or more | Yes | `+?` | `++` | | `?` | 0 or 1 | Yes | `??` | `?+` | | `{n}` | Exactly n | Yes | N/A | N/A | | `{n,}` | n or more | Yes | `{n,}?` | N/A | | `{n,m}` | n to m | Yes | `{n,m}?` | N/A | ```perl /a*/ # 0 or more 'a' (greedy) /a*?/ # 0 or more 'a' (non-greedy) /a+/ # 1 or more 'a' /a?/ # 0 or 1 'a' /a{3}/ # Exactly 3 'a' /a{3,}/ # 3 or more 'a' /a{3,5}/ # 3 to 5 'a' ``` **Greedy vs Non-greedy**: Greedy quantifiers match as much as possible, non-greedy match as little as possible. ```perl # Given: "foo123bar" /\d+/ # Matches "123" (greedy) /\d+?/ # Matches "1" (non-greedy, but entire pattern must match) ``` ### Groups and Captures #### Capturing Groups ```perl /(foo)/ # Capture "foo" in $1 /(foo)(bar)/ # Capture in $1 and $2 ``` Access captures with `$1`, `$2`, etc., or use `@+` array. #### Non-Capturing Groups ```perl /(?:foo)/ # Group but don't capture ``` Use when grouping is needed but capture overhead is not. #### Named Captures ```perl /(?\w+)/ # Named capture "name" /(?'name'\w+)/ # Alternative syntax ``` Access with `$+{name}` hash. ### Alternation ```perl /foo|bar/ # Match "foo" or "bar" /(red|green|blue)/ # Capture color ``` ### Anchors and Assertions #### Zero-Width Assertions | Assertion | Meaning | |-----------|---------| | `(?=pattern)` | Positive lookahead | | `(?!pattern)` | Negative lookahead | | `(?<=pattern)` | Positive lookbehind | | `(?pattern)/ # Atomic group (no backtracking) ``` Once matched, the group's contents are fixed. Used for performance optimization. ### Backreferences ```perl /(foo)\1/ # Match "foofoo" - \1 references first capture /(['"]).*?\1/ # Match quoted string (same quote type) ``` #### Named Backreferences ```perl /(?\w+)...\k/ # Named backreference ``` ### Conditionals ```perl /(?(condition)yes|no)/ # If condition matches, try "yes", else try "no" ``` Conditions can be: - Capture group number: `(?(1)yes|no)` - if group 1 matched - Named capture: `(?()yes|no)` - if named group matched - Lookahead: `(?(?=test)yes|no)` - if lookahead succeeds ## Modifiers Modifiers change regex behavior. Applied after closing delimiter: | Modifier | Meaning | |----------|---------| | `i` | Case-insensitive | | `m` | Multiline (^/$ match line boundaries) | | `s` | Single-line (. matches newline) | | `x` | Extended (ignore whitespace, allow comments) | | `g` | Global (find all matches) | | `c` | Continue searching after failed match | | `o` | Compile once (legacy, not needed in PetaPerl) | | `e` | Evaluate replacement as code (in s///) | ```perl /pattern/i # Case-insensitive /pattern/ms # Multiline + single-line /pattern/x # Extended (readable) /pattern/g # Global matching ``` ### Examples ```perl # Case-insensitive if ($str =~ /hello/i) { ... } # Multiline: ^ and $ match line starts/ends while ($text =~ /^Line: (.+)$/mg) { print "Found: $1\n"; } # Extended: whitespace and comments ignored my $email_re = qr{ (\w+) # Username @ # At sign ([\w.]+) # Domain }x; # Global: find all matches my @words = $text =~ /\w+/g; ``` ## Matching and Substitution ### Match Operator ```perl $str =~ /pattern/ # True if matches $str =~ /pattern/g # Global, returns all matches ``` In list context with captures: ```perl my ($user, $domain) = $email =~ /(\w+)@([\w.]+)/; ``` In list context with global: ```perl my @numbers = $text =~ /\d+/g; # All numbers ``` ### Substitution ```perl $str =~ s/old/new/ # Replace first occurrence $str =~ s/old/new/g # Replace all occurrences $str =~ s/old/new/i # Case-insensitive replace $str =~ s/old/new/gi # Global + case-insensitive ``` **Replacement with captures:** ```perl $str =~ s/(\w+)@(\w+)/$2\@$1/; # Reverse user@domain ``` **Evaluated replacement:** ```perl $str =~ s/(\d+)/$1 * 2/e; # Double all numbers ``` ### Transliteration ```perl $str =~ tr/abc/xyz/ # Replace a→x, b→y, c→z $str =~ y/abc/xyz/ # Same as tr $str =~ tr/a-z/A-Z/ # Uppercase $str =~ tr/ //d # Delete spaces $str =~ tr/a-z//c # Count non-lowercase ``` ## Special Variables After a successful match: | Variable | Contains | |----------|----------| | `$&` | Entire matched string | | `` $` `` | String before match | | `$'` | String after match | | `$1`, `$2`, ... | Capture groups | | `$+` | Last matched capture | | `@+` | End positions of captures | | `@-` | Start positions of captures | | `%+` | Named captures | ```perl if ($str =~ /(foo)(bar)/) { print "Full match: $&\n"; # "foobar" print "Group 1: $1\n"; # "foo" print "Group 2: $2\n"; # "bar" print "Before: $`\n"; print "After: $'\n"; } ``` ## Regex Compilation ### qr// Operator Compile regex for reuse: ```perl my $word = qr/\w+/; my $email = qr/\w+@\w+\.\w+/; if ($str =~ $word) { ... } if ($str =~ /$word@$word/) { ... } # Interpolate ``` Benefits: - Compile once, use many times - Readable regex composition - Performance optimization ## Performance Considerations ### Anchored Patterns Patterns anchored with `^` or `\A` are faster: ```perl /^pattern/ # Fast: only checks start /pattern/ # Slower: scans entire string ``` ### Atomic Groups Use atomic groups `(?>...)` to prevent backtracking: ```perl # Slow: backtracks on failure /\d+\w+/ # Fast: no backtracking in \d+ /(?>\d+)\w+/ ``` ### Non-Capturing Groups Use `(?:...)` when captures aren't needed: ```perl /(?:foo|bar)/ # Faster than /(foo|bar)/ when capture not needed ``` ## PetaPerl-Specific Features ### Bytecode Compilation Regex patterns compile to bytecode for efficient execution. PetaPerl uses: - **Bitmap character classes** for fast ASCII matching - **Literal prefix extraction** to skip impossible positions - **Anchored detection** to avoid unnecessary scanning ### Possessive Quantifiers Possessive quantifiers prevent backtracking entirely (more efficient than atomic groups for simple cases): ```perl /a++/ # 1 or more 'a', no backtracking /a*+/ # 0 or more 'a', no backtracking /a?+/ # 0 or 1 'a', no backtracking ``` ### Embedded Code ```perl /pattern(?{ code })/ # Execute code during match /(??{ code })/ # Postponed regex (code returns pattern) ``` `(?{code})` executes Perl code at the point in the pattern where it appears. The code can access `$1`, `$2`, etc. from captures made so far. ## Current Limitations PetaPerl's regex engine passes 99.3% of perl5's `re_tests` suite (1959/1972 tests). Remaining gaps: - **Self-referential captures** — patterns like `(a\1)` (3 tests) - **`local` in code blocks** — `(?{ local $x = ... })` (2 tests) - **Multi-character case folding** — Unicode chars that fold to multiple chars (2 tests) - **Branch reset backreferences** — complex `(?|...)` with backrefs (5 tests) - **String interpolation edge case** — 1 test Unicode property support (`\p{Letter}`, `\p{Digit}`, etc.) is fully implemented for standard Unicode categories. ## Examples ### Email Validation ```perl my $email_re = qr/^[\w.+-]+@[\w.-]+\.[a-zA-Z]{2,}$/; if ($email =~ $email_re) { print "Valid email\n"; } ``` ### URL Parsing ```perl my ($protocol, $host, $path) = $url =~ m{^(https?)://([^/]+)(/.*)$}; ``` ### Log File Parsing ```perl while ($line =~ /\[(\d{4}-\d{2}-\d{2})\] (\w+): (.+)/g) { my ($date, $level, $msg) = ($1, $2, $3); # Process log entry } ``` ### String Cleanup ```perl # Remove multiple spaces $text =~ s/\s+/ /g; # Remove leading/trailing whitespace $text =~ s/^\s+|\s+$//g; # Or using two substitutions $text =~ s/^\s+//; $text =~ s/\s+$//; ``` ### Template Substitution ```perl my %vars = (name => "John", age => 30); my $template = "Hello {{name}}, you are {{age}} years old."; $template =~ s/\{\{(\w+)\}\}/$vars{$1}/ge; ``` ## See Also - [perlop](perlop.md) - Binding operators `=~` and `!~` - [perlvar](perlvar.md) - Special variables like `$&`, `$1`, etc.