---
name: Unicode pitfalls
---
# Pitfalls

Most Unicode bugs in Perl programs show up as one of a handful of
familiar symptoms. This chapter names each one, shows what produces
it, and points to the fix.

## "Wide character in print"

The warning:

```
Wide character in print at script.pl line N.
```

means you sent a text string to a filehandle that is not configured
for text. The output gets written anyway — Perl encodes it as UTF-8
on the fly — but the warning is your notice that the configuration
is wrong.

The fix is to attach an encoding layer to the handle:

```perl
binmode STDOUT, ":encoding(UTF-8)";
```

or, globally for the standard streams:

```perl
use open ":std", ":encoding(UTF-8)";
```

Do not silence the warning with `no warnings 'utf8'`. The warning
is correct; it is telling you about a real missing configuration.

## Double encoding: *café* becomes *cafÃ©*

Symptom: text looks almost right, but every non-ASCII character has
been replaced by a pair of funny-looking bytes. `café` becomes
`cafÃ©`, `Straße` becomes `StraÃŸe`.

Cause: you encoded to UTF-8, then encoded again. The second stage
read the already-encoded bytes as if they were text, treated each
byte as a Latin-1 code point, and encoded those to UTF-8. Two
passes through UTF-8 encoding is the classic double-encode.

Common producers:

- A filehandle with `:encoding(UTF-8)` that receives an already
  encoded byte string. Fix: send text to the handle, or drop the
  layer.
- An HTTP response whose body is encoded in the application, and
  encoded again by a middleware that assumed the body was text.
- A database client that decodes on read, in a codebase that was
  written before the client learned to decode.

Diagnosis: look at the bytes on the wire. If a single non-ASCII
character takes four or more bytes when it should take two, you
have been through UTF-8 twice.

## Missing decode: *café* stays *café* but indexes are wrong

Symptom: prose looks correct on the terminal, but
[`length`](../../p5/core/perlfunc/length) gives too many,
[`substr`](../../p5/core/perlfunc/substr) cuts in the middle of a
character, a regex like `/^.{5}$/` unexpectedly matches a
four-character word.

Cause: you never decoded. The scalar holds UTF-8 bytes, the
terminal happens to also render them as UTF-8, and the visual
output hides the fact that Perl sees them as bytes.

Fix: decode on the way in, either with
[`binmode`](../../p5/core/perlfunc/binmode) on the handle or
`Encode::decode` on the string:

```perl
use Encode qw(decode);
my $text = decode("UTF-8", $raw_bytes);
```

Heuristic: if `length $s` is unexpectedly larger than the number of
visible characters, you are looking at bytes.

## Mojibake: unknown encoding

Symptom: non-ASCII looks like random line-noise — `café` as `café`,
as `cafÃ©`, as `caf?`, as a box character. Different visual,
different cause each time.

Cause: you decoded with the wrong encoding, or did not decode at
all and the terminal guessed wrong.

There is no universal fix — you must know what encoding the source
actually used. A few reliable anchors:

- Web: the `Content-Type` header's `charset=` parameter, or the
  HTML `<meta charset>` element for files served without one.
- Files from legacy Windows applications: usually `CP1252`
  (Western European) or a regional `CP1250`/`CP1251`/`CP1253`.
- Files from classic Mac: `MacRoman`. Rare but survives in old
  archives.
- Email bodies: the `Content-Type` charset of the MIME part.

If nothing tells you, try UTF-8 first; if that fails with a decode
error, try `ISO-8859-1` (which cannot fail but may produce
nonsense).

## Mixing text and bytes in concatenation

Symptom: a string that concatenates a decoded text fragment with a
raw byte fragment produces garbage after the boundary.

```perl
use utf8;
my $label = "café";                      # text
my $raw   = "\xc3\xa9_ok";               # bytes that also look like text
print $label, $raw;                      # "Wide character" or mojibake
```

Cause: the text half goes through whatever encoding layer is on the
handle; the byte half is reinterpreted as code points by that same
layer, with predictable confusion.

Fix: pick a side. Either decode the byte fragment first, or encode
the text fragment to match. Never concatenate the two kinds of
string without a conscious choice about which side you are on.

## `sprintf "%s"` is not a conversion

[`sprintf`](../../p5/core/perlfunc/sprintf) does not decode or
encode; it formats. Passing a byte string through `%s` yields a
byte string; passing a text string yields a text string. If you
need to change encoding, call `encode` or `decode` explicitly.

## Locale-dependent case folding

Symptom: `lc` or `uc` produces different results on different
machines, or fails to round-trip through a regex.

Cause: the program is running under the `/l` locale-mode regex
modifier, or [`uc`](../../p5/core/perlfunc/uc) / `lc` have been
redirected through a POSIX locale with `use locale`.

Fix: do not use `use locale` for text processing. Unicode case
folding is the right answer for text; POSIX locales are a remnant
of byte-oriented C I/O. The only modern use of `use locale` is
interacting with C libraries that consult `LC_CTYPE`.

## Identifier vs data

Source code that contains non-ASCII characters needs `use utf8`. A
program that processes non-ASCII data needs
[`binmode`](../../p5/core/perlfunc/binmode) or
`:encoding(UTF-8)` layers on its handles. These are separate
configurations, and both may be needed.

A short checklist for a new Unicode-aware script:

```perl
use strict;
use warnings;
use utf8;                                    # source is UTF-8
use open ":std", ":encoding(UTF-8)";         # stdio + default I/O
```

With those four lines at the top, string literals, standard I/O,
and any later [`open`](../../p5/core/perlfunc/open) call already
speak text. Remaining Unicode work is narrow and local — decoding a
byte buffer from a socket, or attaching `:raw` to a handle that
must carry binary data.