Regular expressions and pattern matching

pos#

Report or set where the next /g regex match will resume in a string.

pos reads (or, as an lvalue, writes) the offset that the regex engine stores on a scalar after a global match (m//g, s///g). The offset counts characters, not bytes, and is the position after the last successful match — the place the next /g iteration will start scanning. With no argument pos operates on $_.

Synopsis#

pos SCALAR
pos $str = N
pos

What you get back#

An integer offset, or undef when no position is recorded. 0 is a valid offset and means “start of string”; it is not the same as undef, which means “no /g match has run, or the last one failed and reset the position.” Always distinguish the two with defined:

if (defined pos $str) {
    # a /g scan is in progress
}

Used as an lvalue, pos SCALAR returns an assignable location:

pos($str) = 5;          # next /g match starts at char offset 5

Global state it touches#

pos reads and writes the per-scalar regex position attached to its operand. With no argument it targets $_. The stored offset is what \G anchors against in the next match, so every call to pos potentially changes where \G binds.

Examples#

Walk every word in a string with /g in scalar context, using pos to report progress:

my $s = "one two three";
while ($s =~ /(\w+)/g) {
    printf "%-5s ends at %d\n", $1, pos $s;
}
# one   ends at 3
# two   ends at 7
# three ends at 13

Skip ahead before starting the scan. The first match begins at offset 4, not 0:

my $s = "AAA BBB CCC";
pos($s) = 4;
$s =~ /(\w+)/g;
print $1, "\n";             # BBB

Anchor a follow-up match to the previous one with \G. Without \G the engine would scan forward past any gap; with it, the match must start exactly where the last one ended:

my $s = "12ab34cd";
while ($s =~ /\G(\d+)(\w+?)(?=\d|\z)/g) {
    print "num=$1 tail=$2\n";
}
# num=12 tail=ab
# num=34 tail=cd

Restart a scan by clearing the position:

pos($s) = undef;            # next /g starts from offset 0 again

Edge cases#

  • Bare pos targets $_. pos inside while (<>) { ... } therefore reports the position on the current input line.

  • Characters, not bytes. For a string containing multi-byte characters, pos returns the character offset. The (deprecated) use bytes pragma switches to byte offsets; new code should not rely on it.

  • Failed /g match resets the position to undef — the next /g starts over at offset 0. Add the /c modifier (m//gc) to preserve the position on failure, which is the usual idiom when composing several alternative \G-anchored patterns against the same string.

  • Reads during a match are stale. pos reflects the previous match’s end. Expressions like (?{ pos() = 5 }) or s//pos() = 5/e influence the next match, not the one currently running.

  • Zero-length match flag. Setting pos also clears the internal matched with zero-length flag, so a subsequent zero-width match at the same position is allowed again. See perlre under Repeated Patterns Matching a Zero-length Substring.

  • Non-lvalue operand. pos requires a real scalar variable for the lvalue form; pos("literal") = 3 is a compile-time error.

  • Offset 0 vs undef. pos of 0 means the next /g starts at the beginning; undef means no position is set. They behave identically for the first match but differ after the zero-length-match flag state is considered.

Differences from upstream#

Fully compatible with upstream Perl 5.42.

See also#

  • m — the /g modifier is what creates and advances the position pos reads; the /c modifier on failure preserves it

  • qr — precompile a pattern once, then reuse it in \G-anchored /g loops without reparsing

  • split — the other everyday way to walk a string in pieces; no pos involved, but often a cleaner choice when the delimiters are simple

  • \G — zero-width anchor that binds to the current pos; the main reason to touch pos in the first place

  • $_ — default target of bare pos