--- name: Unicode pitfalls --- # Pitfalls Most Unicode bugs in Perl programs show up as one of a handful of familiar symptoms. This chapter names each one, shows what produces it, and points to the fix. ## "Wide character in print" The warning: ``` Wide character in print at script.pl line N. ``` means you sent a text string to a filehandle that is not configured for text. The output gets written anyway — Perl encodes it as UTF-8 on the fly — but the warning is your notice that the configuration is wrong. The fix is to attach an encoding layer to the handle: ```perl binmode STDOUT, ":encoding(UTF-8)"; ``` or, globally for the standard streams: ```perl use open ":std", ":encoding(UTF-8)"; ``` Do not silence the warning with `no warnings 'utf8'`. The warning is correct; it is telling you about a real missing configuration. ## Double encoding: *café* becomes *café* Symptom: text looks almost right, but every non-ASCII character has been replaced by a pair of funny-looking bytes. `café` becomes `café`, `Straße` becomes `Straße`. Cause: you encoded to UTF-8, then encoded again. The second stage read the already-encoded bytes as if they were text, treated each byte as a Latin-1 code point, and encoded those to UTF-8. Two passes through UTF-8 encoding is the classic double-encode. Common producers: - A filehandle with `:encoding(UTF-8)` that receives an already encoded byte string. Fix: send text to the handle, or drop the layer. - An HTTP response whose body is encoded in the application, and encoded again by a middleware that assumed the body was text. - A database client that decodes on read, in a codebase that was written before the client learned to decode. Diagnosis: look at the bytes on the wire. If a single non-ASCII character takes four or more bytes when it should take two, you have been through UTF-8 twice. ## Missing decode: *café* stays *café* but indexes are wrong Symptom: prose looks correct on the terminal, but [`length`](../../p5/core/perlfunc/length) gives too many, [`substr`](../../p5/core/perlfunc/substr) cuts in the middle of a character, a regex like `/^.{5}$/` unexpectedly matches a four-character word. Cause: you never decoded. The scalar holds UTF-8 bytes, the terminal happens to also render them as UTF-8, and the visual output hides the fact that Perl sees them as bytes. Fix: decode on the way in, either with [`binmode`](../../p5/core/perlfunc/binmode) on the handle or `Encode::decode` on the string: ```perl use Encode qw(decode); my $text = decode("UTF-8", $raw_bytes); ``` Heuristic: if `length $s` is unexpectedly larger than the number of visible characters, you are looking at bytes. ## Mojibake: unknown encoding Symptom: non-ASCII looks like random line-noise — `café` as `café`, as `café`, as `caf?`, as a box character. Different visual, different cause each time. Cause: you decoded with the wrong encoding, or did not decode at all and the terminal guessed wrong. There is no universal fix — you must know what encoding the source actually used. A few reliable anchors: - Web: the `Content-Type` header's `charset=` parameter, or the HTML `` element for files served without one. - Files from legacy Windows applications: usually `CP1252` (Western European) or a regional `CP1250`/`CP1251`/`CP1253`. - Files from classic Mac: `MacRoman`. Rare but survives in old archives. - Email bodies: the `Content-Type` charset of the MIME part. If nothing tells you, try UTF-8 first; if that fails with a decode error, try `ISO-8859-1` (which cannot fail but may produce nonsense). ## Mixing text and bytes in concatenation Symptom: a string that concatenates a decoded text fragment with a raw byte fragment produces garbage after the boundary. ```perl use utf8; my $label = "café"; # text my $raw = "\xc3\xa9_ok"; # bytes that also look like text print $label, $raw; # "Wide character" or mojibake ``` Cause: the text half goes through whatever encoding layer is on the handle; the byte half is reinterpreted as code points by that same layer, with predictable confusion. Fix: pick a side. Either decode the byte fragment first, or encode the text fragment to match. Never concatenate the two kinds of string without a conscious choice about which side you are on. ## `sprintf "%s"` is not a conversion [`sprintf`](../../p5/core/perlfunc/sprintf) does not decode or encode; it formats. Passing a byte string through `%s` yields a byte string; passing a text string yields a text string. If you need to change encoding, call `encode` or `decode` explicitly. ## Locale-dependent case folding Symptom: `lc` or `uc` produces different results on different machines, or fails to round-trip through a regex. Cause: the program is running under the `/l` locale-mode regex modifier, or [`uc`](../../p5/core/perlfunc/uc) / `lc` have been redirected through a POSIX locale with `use locale`. Fix: do not use `use locale` for text processing. Unicode case folding is the right answer for text; POSIX locales are a remnant of byte-oriented C I/O. The only modern use of `use locale` is interacting with C libraries that consult `LC_CTYPE`. ## Identifier vs data Source code that contains non-ASCII characters needs `use utf8`. A program that processes non-ASCII data needs [`binmode`](../../p5/core/perlfunc/binmode) or `:encoding(UTF-8)` layers on its handles. These are separate configurations, and both may be needed. A short checklist for a new Unicode-aware script: ```perl use strict; use warnings; use utf8; # source is UTF-8 use open ":std", ":encoding(UTF-8)"; # stdio + default I/O ``` With those four lines at the top, string literals, standard I/O, and any later [`open`](../../p5/core/perlfunc/open) call already speak text. Remaining Unicode work is narrow and local — decoding a byte buffer from a socket, or attaching `:raw` to a handle that must carry binary data.