Pitfalls#

Most Unicode bugs in Perl programs show up as one of a handful of familiar symptoms. This chapter names each one, shows what produces it, and points to the fix.

“Wide character in print”#

The warning:

Wide character in print at script.pl line N.

means you sent a text string to a filehandle that is not configured for text. The output gets written anyway — Perl encodes it as UTF-8 on the fly — but the warning is your notice that the configuration is wrong.

The fix is to attach an encoding layer to the handle:

binmode STDOUT, ":encoding(UTF-8)";

or, globally for the standard streams:

use open ":std", ":encoding(UTF-8)";

Do not silence the warning with no warnings 'utf8'. The warning is correct; it is telling you about a real missing configuration.

Double encoding: café becomes cafÃ©#

Symptom: text looks almost right, but every non-ASCII character has been replaced by a pair of funny-looking bytes. café becomes cafÃ©, Straße becomes StraÃŸe.

Cause: you encoded to UTF-8, then encoded again. The second stage read the already-encoded bytes as if they were text, treated each byte as a Latin-1 code point, and encoded those to UTF-8. Two passes through UTF-8 encoding is the classic double-encode.

Common producers:

A filehandle with :encoding(UTF-8) that receives an already encoded byte string. Fix: send text to the handle, or drop the layer.
An HTTP response whose body is encoded in the application, and encoded again by a middleware that assumed the body was text.
A database client that decodes on read, in a codebase that was written before the client learned to decode.

Diagnosis: look at the bytes on the wire. If a single non-ASCII character takes four or more bytes when it should take two, you have been through UTF-8 twice.

Missing decode: café stays café but indexes are wrong#

Symptom: prose looks correct on the terminal, but length gives too many, substr cuts in the middle of a character, a regex like /^.{5}$/ unexpectedly matches a four-character word.

Cause: you never decoded. The scalar holds UTF-8 bytes, the terminal happens to also render them as UTF-8, and the visual output hides the fact that Perl sees them as bytes.

Fix: decode on the way in, either with binmode on the handle or Encode::decode on the string:

use Encode qw(decode);
my $text = decode("UTF-8", $raw_bytes);

Heuristic: if length $s is unexpectedly larger than the number of visible characters, you are looking at bytes.

Mojibake: unknown encoding#

Symptom: non-ASCII looks like random line-noise — café as café, as cafÃ©, as caf?, as a box character. Different visual, different cause each time.

Cause: you decoded with the wrong encoding, or did not decode at all and the terminal guessed wrong.

There is no universal fix — you must know what encoding the source actually used. A few reliable anchors:

Web: the Content-Type header’s charset= parameter, or the HTML <meta charset> element for files served without one.
Files from legacy Windows applications: usually CP1252 (Western European) or a regional CP1250/CP1251/CP1253.
Files from classic Mac: MacRoman. Rare but survives in old archives.
Email bodies: the Content-Type charset of the MIME part.

If nothing tells you, try UTF-8 first; if that fails with a decode error, try ISO-8859-1 (which cannot fail but may produce nonsense).

Mixing text and bytes in concatenation#

Symptom: a string that concatenates a decoded text fragment with a raw byte fragment produces garbage after the boundary.

use utf8;
my $label = "café";                      # text
my $raw   = "\xc3\xa9_ok";               # bytes that also look like text
print $label, $raw;                      # "Wide character" or mojibake

Cause: the text half goes through whatever encoding layer is on the handle; the byte half is reinterpreted as code points by that same layer, with predictable confusion.

Fix: pick a side. Either decode the byte fragment first, or encode the text fragment to match. Never concatenate the two kinds of string without a conscious choice about which side you are on.

`sprintf "%s"` is not a conversion#

sprintf does not decode or encode; it formats. Passing a byte string through %s yields a byte string; passing a text string yields a text string. If you need to change encoding, call encode or decode explicitly.

Locale-dependent case folding#

Symptom: lc or uc produces different results on different machines, or fails to round-trip through a regex.

Cause: the program is running under the /l locale-mode regex modifier, or uc / lc have been redirected through a POSIX locale with use locale.

Fix: do not use use locale for text processing. Unicode case folding is the right answer for text; POSIX locales are a remnant of byte-oriented C I/O. The only modern use of use locale is interacting with C libraries that consult LC_CTYPE.

Identifier vs data#

Source code that contains non-ASCII characters needs use utf8. A program that processes non-ASCII data needs binmode or :encoding(UTF-8) layers on its handles. These are separate configurations, and both may be needed.

A short checklist for a new Unicode-aware script:

use strict;
use warnings;
use utf8;                                    # source is UTF-8
use open ":std", ":encoding(UTF-8)";         # stdio + default I/O

With those four lines at the top, string literals, standard I/O, and any later open call already speak text. Remaining Unicode work is narrow and local — decoding a byte buffer from a socket, or attaching :raw to a handle that must carry binary data.