Pitfalls#
Most Unicode bugs in Perl programs show up as one of a handful of familiar symptoms. This chapter names each one, shows what produces it, and points to the fix.
“Wide character in print”#
The warning:
Wide character in print at script.pl line N.
means you sent a text string to a filehandle that is not configured for text. The output gets written anyway — Perl encodes it as UTF-8 on the fly — but the warning is your notice that the configuration is wrong.
The fix is to attach an encoding layer to the handle:
binmode STDOUT, ":encoding(UTF-8)";
or, globally for the standard streams:
use open ":std", ":encoding(UTF-8)";
Do not silence the warning with no warnings 'utf8'. The warning
is correct; it is telling you about a real missing configuration.
Double encoding: café becomes café#
Symptom: text looks almost right, but every non-ASCII character has
been replaced by a pair of funny-looking bytes. café becomes
café, Straße becomes Straße.
Cause: you encoded to UTF-8, then encoded again. The second stage read the already-encoded bytes as if they were text, treated each byte as a Latin-1 code point, and encoded those to UTF-8. Two passes through UTF-8 encoding is the classic double-encode.
Common producers:
A filehandle with
:encoding(UTF-8)that receives an already encoded byte string. Fix: send text to the handle, or drop the layer.An HTTP response whose body is encoded in the application, and encoded again by a middleware that assumed the body was text.
A database client that decodes on read, in a codebase that was written before the client learned to decode.
Diagnosis: look at the bytes on the wire. If a single non-ASCII character takes four or more bytes when it should take two, you have been through UTF-8 twice.
Missing decode: café stays café but indexes are wrong#
Symptom: prose looks correct on the terminal, but
length gives too many,
substr cuts in the middle of a
character, a regex like /^.{5}$/ unexpectedly matches a
four-character word.
Cause: you never decoded. The scalar holds UTF-8 bytes, the terminal happens to also render them as UTF-8, and the visual output hides the fact that Perl sees them as bytes.
Fix: decode on the way in, either with
binmode on the handle or
Encode::decode on the string:
use Encode qw(decode);
my $text = decode("UTF-8", $raw_bytes);
Heuristic: if length $s is unexpectedly larger than the number of
visible characters, you are looking at bytes.
Mojibake: unknown encoding#
Symptom: non-ASCII looks like random line-noise — café as café,
as café, as caf?, as a box character. Different visual,
different cause each time.
Cause: you decoded with the wrong encoding, or did not decode at all and the terminal guessed wrong.
There is no universal fix — you must know what encoding the source actually used. A few reliable anchors:
Web: the
Content-Typeheader’scharset=parameter, or the HTML<meta charset>element for files served without one.Files from legacy Windows applications: usually
CP1252(Western European) or a regionalCP1250/CP1251/CP1253.Files from classic Mac:
MacRoman. Rare but survives in old archives.Email bodies: the
Content-Typecharset of the MIME part.
If nothing tells you, try UTF-8 first; if that fails with a decode
error, try ISO-8859-1 (which cannot fail but may produce
nonsense).
Mixing text and bytes in concatenation#
Symptom: a string that concatenates a decoded text fragment with a raw byte fragment produces garbage after the boundary.
use utf8;
my $label = "café"; # text
my $raw = "\xc3\xa9_ok"; # bytes that also look like text
print $label, $raw; # "Wide character" or mojibake
Cause: the text half goes through whatever encoding layer is on the handle; the byte half is reinterpreted as code points by that same layer, with predictable confusion.
Fix: pick a side. Either decode the byte fragment first, or encode the text fragment to match. Never concatenate the two kinds of string without a conscious choice about which side you are on.
sprintf "%s" is not a conversion#
sprintf does not decode or
encode; it formats. Passing a byte string through %s yields a
byte string; passing a text string yields a text string. If you
need to change encoding, call encode or decode explicitly.
Locale-dependent case folding#
Symptom: lc or uc produces different results on different
machines, or fails to round-trip through a regex.
Cause: the program is running under the /l locale-mode regex
modifier, or uc / lc have been
redirected through a POSIX locale with use locale.
Fix: do not use use locale for text processing. Unicode case
folding is the right answer for text; POSIX locales are a remnant
of byte-oriented C I/O. The only modern use of use locale is
interacting with C libraries that consult LC_CTYPE.
Identifier vs data#
Source code that contains non-ASCII characters needs use utf8. A
program that processes non-ASCII data needs
binmode or
:encoding(UTF-8) layers on its handles. These are separate
configurations, and both may be needed.
A short checklist for a new Unicode-aware script:
use strict;
use warnings;
use utf8; # source is UTF-8
use open ":std", ":encoding(UTF-8)"; # stdio + default I/O
With those four lines at the top, string literals, standard I/O,
and any later open call already
speak text. Remaining Unicode work is narrow and local — decoding a
byte buffer from a socket, or attaching :raw to a handle that
must carry binary data.