--- status: documented name: Identifier length cap --- # Identifier length cap (`Identifier too long`) ## Upstream wording `perl5-upstream/pod/perlvar.pod` says, in the **Syntax of Variable Names** section: > Variable names in Perl can have several formats. Usually, they > must begin with a letter or underscore, in which case they can be > arbitrarily long (up to an internal limit of 251 characters) and > may contain letters, digits, underscores, or the special sequence > `::` or `'`. This wording is misleading on two counts. ## Correction The correct statement, as of the **Perl 5.42** stable series, is: > Identifiers may be up to **251 bytes** long. Names exceeding this > length cause a parse-time error: `Identifier too long`. Two things to take from this: 1. **Bytes, not characters.** The cap is measured in bytes of the source representation, not in Unicode code points. UTF-8 identifiers therefore reach the limit at fewer characters: | Char width | Last accepted | First rejected | |--------------------------------|---------------|----------------| | 1 byte (ASCII) | 251 chars | 252 chars | | 2 bytes (e.g. `รฉ`) | 125 chars | 126 chars | | 3 bytes (e.g. `ใ„š`) | 83 chars | 84 chars | | 4 bytes (e.g. `๐Œฐ` โ€” Gothic) | 62 chars | 63 chars | In every case the rejection occurs when the byte length first exceeds 251. The "251 characters" in the upstream text is correct only for the ASCII case. 2. **Hard error, not "internal limit".** The phrasing "up to an internal limit" suggests something soft. It is not โ€” perl5's parser raises a fatal `Identifier too long` exception. The message is the only thing the user sees; there is no truncation, warning, or fallback. The proximate cause in 5.42 is the `tokenbuf[256]` buffer in the parser state, combined with `e = dest_end - 3` (sigil byte + trailing `\0` + safety margin) accounted for inside `S_parse_ident`. The combination yields 252 bytes total รท subtract sigil = 251 bytes of identifier name proper. ### Upstream blead has already raised the cap Independent of this errata, perl5 development (post-5.42) has already raised the cap. Commit [`8785c114b5`](https://github.com/Perl/perl5/commit/8785c114b5) ("parser.h Allow up to 256 characters in a token"), which first appears in the v5.43.4 development tag, replaces the 256-byte tokenbuf with a 1024-byte buffer. The next stable series (Perl 5.44) is therefore expected to permit identifiers up to roughly 1020 bytes by default. The 251-byte cap is a 5.42-and-earlier characteristic. ## p5 runtime PetaPerl's p5 runtime is the c2rust transpilation of perl5's C interpreter. Its identifier-length behavior is: | Length | perl5 5.42.2 | pperl `--p5` | |---------------|---------------|---------------------------| | 251 bytes | accept | accept | | **252 bytes** | **reject** | **accept** (divergence) | | 1019 bytes | reject | accept (divergence) | | 1020 bytes | reject | reject | | 1024 bytes | reject | reject | The divergence is real and visible to user code. **The cause is not in the c2rust transpilation itself.** The c2rust output is a faithful, line-for-line mirror of the C source it is given. The cause is upstream-checkout drift: `perl5-upstream/` is currently pinned to a blead snapshot (post-5.43.4) rather than to a 5.42.x maintenance tag. Blead has the raised tokenbuf โ€” the c2rust mirror inherits it. This is an instance of a broader category of latent divergences: upstream blead has accumulated 140+ commits to `toke.c` alone since the 5.42.2 tag, and the same arithmetic โ€” *whatever blead does is what pperl `--p5` does* โ€” applies to every one of them. Other likely-affected areas (parser-visible only): - **Enhanced `qr/.../xx`** ([PPC 0026](https://github.com/Perl/PPCs)) lands in blead; the whitespace/comment handling inside character classes differs from 5.42. - **`intuit_more` bug fixes** (multiple commits): edge cases in `$foo[bar]` vs `$foo{bar}` ambiguity resolution behave better in pperl `--p5` than in 5.42. - **`S_parse_ident` API rewrite**: the error position reported by identifier-related parse errors moved between 5.42 and blead. The remediation is not to patch the c2rust output; it is to pin `perl5-upstream/` to `v5.42.2` and re-run the transpile pipeline. Until that is done, the p5 runtime is โ€” strictly speaking โ€” "perl5 next" rather than "perl5 5.42 stable". For most user code this is invisible; for the corner cases above, behavior matches blead, not 5.42. ## pp runtime The pp runtime (`pperl --pp`) uses its own native parser (`src/parser/lexer.rs`). It does **not** enforce any cap at all โ€” the identifier scanner loops over identifier characters without a length bound. Names of arbitrary length (10 000 bytes and beyond) are silently accepted. This is a deliberate divergence. The pp runtime is not bound to upstream perl5's byte-for-byte conformance; where pp behaves *better* than upstream โ€” and "no arbitrary 251-byte ceiling" is arguably better โ€” the project keeps that improvement. Note that upstream blead is itself moving in the same direction (the 256โ†’1024 buffer raise), which suggests this concern is widely shared. ## Tests The conformance tests covering this errata live in: - `t/01-parsing/090-identifier-length-ascii.t` - `t/01-parsing/091-identifier-length-utf8.t` They probe the boundary at 251 bytes (ASCII and 2/3/4-byte UTF-8) and at the upstream-blead secondary cutoff near 1024 bytes, plus pp-runtime-relevant lengths up to 10 000 bytes. Each subtest encodes a perl5 5.42 expectation; pperl currently fails subsets of the suite as documented above. The failing subtests are the operational tracking record for this errata entry.