Identifier length cap (Identifier too long)#

Upstream wording#

perl5-upstream/pod/perlvar.pod says, in the Syntax of Variable Names section:

Variable names in Perl can have several formats. Usually, they must begin with a letter or underscore, in which case they can be arbitrarily long (up to an internal limit of 251 characters) and may contain letters, digits, underscores, or the special sequence :: or '.

This wording is misleading on two counts.

Correction#

The correct statement, as of the Perl 5.42 stable series, is:

Identifiers may be up to 251 bytes long. Names exceeding this length cause a parse-time error: Identifier too long.

Two things to take from this:

  1. Bytes, not characters. The cap is measured in bytes of the source representation, not in Unicode code points. UTF-8 identifiers therefore reach the limit at fewer characters:

    Char width

    Last accepted

    First rejected

    1 byte (ASCII)

    251 chars

    252 chars

    2 bytes (e.g. é)

    125 chars

    126 chars

    3 bytes (e.g. )

    83 chars

    84 chars

    4 bytes (e.g. 𐌰 — Gothic)

    62 chars

    63 chars

    In every case the rejection occurs when the byte length first exceeds 251. The “251 characters” in the upstream text is correct only for the ASCII case.

  2. Hard error, not “internal limit”. The phrasing “up to an internal limit” suggests something soft. It is not — perl5’s parser raises a fatal Identifier too long exception. The message is the only thing the user sees; there is no truncation, warning, or fallback.

The proximate cause in 5.42 is the tokenbuf[256] buffer in the parser state, combined with e = dest_end - 3 (sigil byte + trailing \0 + safety margin) accounted for inside S_parse_ident. The combination yields 252 bytes total ÷ subtract sigil = 251 bytes of identifier name proper.

Upstream blead has already raised the cap#

Independent of this errata, perl5 development (post-5.42) has already raised the cap. Commit 8785c114b5 (“parser.h Allow up to 256 characters in a token”), which first appears in the v5.43.4 development tag, replaces the 256-byte tokenbuf with a 1024-byte buffer. The next stable series (Perl 5.44) is therefore expected to permit identifiers up to roughly 1020 bytes by default. The 251-byte cap is a 5.42-and-earlier characteristic.

p5 runtime#

PetaPerl’s p5 runtime is the c2rust transpilation of perl5’s C interpreter. Its identifier-length behavior is:

Length

perl5 5.42.2

pperl --p5

251 bytes

accept

accept

252 bytes

reject

accept (divergence)

1019 bytes

reject

accept (divergence)

1020 bytes

reject

reject

1024 bytes

reject

reject

The divergence is real and visible to user code. The cause is not in the c2rust transpilation itself. The c2rust output is a faithful, line-for-line mirror of the C source it is given. The cause is upstream-checkout drift: perl5-upstream/ is currently pinned to a blead snapshot (post-5.43.4) rather than to a 5.42.x maintenance tag. Blead has the raised tokenbuf — the c2rust mirror inherits it.

This is an instance of a broader category of latent divergences: upstream blead has accumulated 140+ commits to toke.c alone since the 5.42.2 tag, and the same arithmetic — whatever blead does is what pperl --p5 does — applies to every one of them. Other likely-affected areas (parser-visible only):

  • Enhanced qr/.../xx (PPC 0026) lands in blead; the whitespace/comment handling inside character classes differs from 5.42.

  • intuit_more bug fixes (multiple commits): edge cases in $foo[bar] vs $foo{bar} ambiguity resolution behave better in pperl --p5 than in 5.42.

  • S_parse_ident API rewrite: the error position reported by identifier-related parse errors moved between 5.42 and blead.

The remediation is not to patch the c2rust output; it is to pin perl5-upstream/ to v5.42.2 and re-run the transpile pipeline.

Until that is done, the p5 runtime is — strictly speaking — “perl5 next” rather than “perl5 5.42 stable”. For most user code this is invisible; for the corner cases above, behavior matches blead, not 5.42.

pp runtime#

The pp runtime (pperl --pp) uses its own native parser (src/parser/lexer.rs). It does not enforce any cap at all — the identifier scanner loops over identifier characters without a length bound. Names of arbitrary length (10 000 bytes and beyond) are silently accepted.

This is a deliberate divergence. The pp runtime is not bound to upstream perl5’s byte-for-byte conformance; where pp behaves better than upstream — and “no arbitrary 251-byte ceiling” is arguably better — the project keeps that improvement. Note that upstream blead is itself moving in the same direction (the 256→1024 buffer raise), which suggests this concern is widely shared.

Tests#

The conformance tests covering this errata live in:

  • t/01-parsing/090-identifier-length-ascii.t

  • t/01-parsing/091-identifier-length-utf8.t

They probe the boundary at 251 bytes (ASCII and 2/3/4-byte UTF-8) and at the upstream-blead secondary cutoff near 1024 bytes, plus pp-runtime-relevant lengths up to 10 000 bytes. Each subtest encodes a perl5 5.42 expectation; pperl currently fails subsets of the suite as documented above. The failing subtests are the operational tracking record for this errata entry.