Regular expressions and pattern matching

split#

Cut a string into a list of fields using a regex separator.

split scans EXPR for matches of PATTERN and returns the pieces between the matches as a list. The matched text itself is the separator and is not included in the result — unless the pattern contains capture groups, in which case each capture becomes an extra element in the output. In scalar context the list size is returned.

If PATTERN is omitted, split behaves like the awk default: runs of whitespace are the separator and any leading whitespace in EXPR is stripped. If EXPR is omitted, $_ is split.

Synopsis#

split /PATTERN/, EXPR, LIMIT
split /PATTERN/, EXPR
split /PATTERN/
split

What you get back#

A list of substrings (fields) in list context; the count of fields in scalar context. The separator text is never part of a field.

my @fields = split /,/, "a,b,c";       # ("a", "b", "c")
my $n      = split /,/, "a,b,c";       # 3

Splitting an EXPR that is the empty string always yields zero fields, regardless of LIMIT:

my @x = split /,/, "", -1;             # ()

Prior to Perl 5.11, split in void or scalar context overwrote @_. Modern Perl does not; never rely on that old side effect.

The LIMIT argument#

LIMIT controls how many fields are produced and, crucially, whether trailing empty fields are kept.

  • Positive LIMIT — upper bound on the number of fields. EXPR is split at most LIMIT - 1 times; the final field holds the rest of the string verbatim. LIMIT = 1 means no splits at all — you get the whole string back as a single element.

    my @x = split /,/, "a,b,c", 1;       # ("a,b,c")
    my @x = split /,/, "a,b,c", 2;       # ("a", "b,c")
    my @x = split /,/, "a,b,c", 3;       # ("a", "b", "c")
    my @x = split /,/, "a,b,c", 4;       # ("a", "b", "c")
    
  • Negative LIMIT — treated as arbitrarily large. As many fields as possible are produced, including all trailing empty fields.

    my @x = split /,/, "a,b,c,,,", -1;   # ("a", "b", "c", "", "", "")
    
  • Omitted or zero LIMIT — like negative, except trailing empty fields are stripped. Leading empty fields are always kept.

    my @x = split /,/, "a,b,c,,,";       # ("a", "b", "c")
    my @x = split /,/, ",,a,b";          # ("", "", "a", "b")
    
  • Implicit LIMIT in list assignment. When assigning to a fixed list, Perl sets LIMIT to one more than the number of targets, so the trailing fields do not need to be scanned. The following gets LIMIT = 3 automatically:

    my ($login, $passwd) = split /:/;
    

    In time-critical code, pass an explicit LIMIT rather than letting the whole string be scanned.

The awk-style whitespace case: split " "#

A literal single space as the pattern — split " ", not split / / — triggers a special case: PATTERN is treated as /\s+/ and any leading whitespace in EXPR is stripped before splitting. This matches classic awk behaviour.

my @x = split " ", "  Quick brown fox\n";
# ("Quick", "brown", "fox")

my @x = split " ", "RED\tGREEN\tBLUE";
# ("RED", "GREEN", "BLUE")

To split on a single literal space only, use the regex form / / — it is not special-cased:

my @x = split / /, " abc";             # ("", "abc")

Since Perl 5.18, the trigger is any expression whose value is the single-character string " " (not just a literal), and since Perl 5.28 the rule works correctly under use feature 'unicode_strings'. Under Perl 5.39.9+ the /x default modifier does not affect split STRING, so split " " still means awk-emulation even inside use re "/x". If you want to split on one literal space under use re "/x", write split /(?-x: )/ or split /\x{20}/.

If PATTERN is omitted entirely, it defaults to " ", so bare split is the awk-style form.

The empty-pattern case: splitting into characters#

If PATTERN matches the empty string, the split happens at every match position — i.e. between characters.

my @x = split //, "abc";               # ("a", "b", "c")

As a split-specific rule, the bare match operator // is not the “repeat the last successful match” form here; it is the literal empty pattern. A zero-width match at the very start of EXPR never produces an empty leading field, which is why splitting a leading space is handled like this:

my @x = split //, " abc";              # (" ", "a", "b", "c")   — 4, not 5
my @x = split //, " abc", -1;          # (" ", "a", "b", "c", "") — trailing empty with -1

A positive-width match at the start does produce an empty leading field:

my @x = split / /, " abc";             # ("", "abc")

Capture groups become extra elements#

If PATTERN contains capturing groups, each capture is inserted into the output list for every separator match, in the order the groups are declared. A group that does not participate in the match contributes undef. These extras do not count toward LIMIT.

my @x = split /-|,/    , "1-10,20", 3;
# ("1", "10", "20")

my @x = split /(-|,)/  , "1-10,20", 3;
# ("1", "-", "10", ",", "20")

my @x = split /-|(,)/  , "1-10,20", 3;
# ("1", undef, "10", ",", "20")

my @x = split /(-)|,/  , "1-10,20", 3;
# ("1", "-", "10", undef, "20")

my @x = split /(-)|(,)/, "1-10,20", 3;
# ("1", "-", undef, "10", undef, ",", "20")

Use this when you want the separators preserved alongside the fields — typical for tokenisers where the delimiters are themselves part of the output stream.

Global state it touches#

  • $_EXPR defaults to it when no second argument is given.

  • @_ — modern Perl does not touch it. Only relevant if you support pre-5.11 perls.

  • The regex engine’s match variables ($1, $2, …, $&, $', $``) are **not** set as a side effect of split, even when PATTERN` captures — the captures go into the returned list.

Examples#

Classic CSV-style line, no embedded commas:

my @cells = split /,/, "alice,bob,carol";    # ("alice", "bob", "carol")

Parse a /etc/passwd line into known fields; implicit LIMIT stops at the first seven separators:

my ($user, $pw, $uid, $gid, $gecos, $home, $shell)
    = split /:/, $line;

Awk-style whitespace tokenising, leading whitespace discarded:

my @words = split " ", "   one\ttwo  three\n";
# ("one", "two", "three")

Characters of a string — the empty pattern:

my @chars = split //, "hello";               # ("h", "e", "l", "l", "o")

Keep every trailing empty field for a strict column-count reader:

my @cols = split /\t/, $line, -1;

Preserve separators by capturing — round-trip reconstructible:

my @parts = split /([,;])/, "a,b;c";         # ("a", ",", "b", ";", "c")
my $same  = join "", @parts;                 # "a,b;c"

Default form reads $_ and does awk-style whitespace splitting:

for (@lines) {
    my @f = split;                           # split " ", $_
    ...
}

Edge cases#

  • Empty inputsplit /X/, "" returns the empty list for every LIMIT, including -1.

  • LIMIT = 1 — no splitting occurs; the whole string is the one and only field. Useful when you want split semantics conditionally without losing the input.

  • /^/ pattern — treated as if /^/m so it matches at every line start, which is the only useful behaviour. Splits a string into lines keeping the newlines.

  • Zero-width match at position 0 — never produces an empty leading field (see split //, " abc" above).

  • Positive-width match at position 0 — produces an empty leading field, always.

  • Match at end of string — produces an empty trailing field, which is then stripped unless LIMIT is non-zero.

  • Patterns with modifiers — the usual qr// modifiers (/i, /m, /s, /x, /u, /a, /l, /n) apply. The pattern need not be a literal; any expression producing a regex or string works, and qr// objects are accepted.

  • Single-space string that is not a literal — from Perl 5.18 on, any expression evaluating to " " triggers the awk-style special case, not just the literal " ".

  • split " " vs split / / — the first is awk-style whitespace (\s+ with leading-ws stripping), the second is a single literal space. The two are not interchangeable.

  • Unicode whitespace — under use feature 'unicode_strings' (default since 5.28 for the awk-style case), split " " treats Unicode whitespace as a separator too. Outside that scope the behaviour is affected by the “Unicode Bug”.

Differences from upstream#

Fully compatible with upstream Perl 5.42.

See also#

  • join — inverse operation; stitches a list back into a string with a chosen glue

  • m — the match operator; when you want to find something rather than cut the string around it

  • qr — pre-compile a pattern once for repeated split calls in a hot loop

  • index — locate a fixed substring without building the full list of fields; cheaper when you only need the first split

  • substr — extract by byte/character offset when the field boundaries are positional, not delimited