---
name: Strings and encodings
---
# Strings and encodings

A Perl string holds a sequence of integers. What those integers
*mean* is up to you. In a text string they are Unicode code points;
in a binary string they are bytes. Perl does not stamp "this is
text" on a scalar — you carry that knowledge with the data, and the
boundary between the two worlds is drawn by `encode` and `decode`.

## Text in source code: `use utf8`

By default, Perl reads your source file as a sequence of bytes. A
string literal containing `é` is a two-byte string `"\xc3\xa9"`, not
the one-character string `"\x{e9}"`. That is almost never what you
want when you typed `é` in an editor set to UTF-8.

`use utf8` tells the parser that the source file is UTF-8 and that
literals should be decoded as they are read:

```perl
use utf8;
my $greeting = "café";
length $greeting;                 # 4  — characters
```

Without `use utf8`, the same line produces:

```perl
my $greeting = "café";
length $greeting;                 # 5  — bytes
```

Rule of thumb: if your source file contains any non-ASCII character,
write `use utf8` at the top and save the file as UTF-8. The pragma
only governs the source file; it has nothing to do with I/O.

## The internal representation

Perl keeps text strings in a representation of its own choosing. You
do not need to know what it is, and you must not depend on it. The
only guarantee is:

- A text string is a sequence of code points. [`length`](../../p5/core/perlfunc/length)
  counts them, [`substr`](../../p5/core/perlfunc/substr) indexes
  them, [`chr`](../../p5/core/perlfunc/chr) and
  [`ord`](../../p5/core/perlfunc/ord) map between characters and
  their code points.
- A binary string is a sequence of bytes. The same functions now
  count and index bytes.

The `utf8` pragma controls the *source*. There is also a `utf8::`
namespace of functions for probing and forcing the internal flag —
they exist, they are occasionally useful, and in normal application
code you should not need them. Stick to `encode` and `decode`.

## The `Encode` module

The standard way to convert between text and bytes is the `Encode`
module:

```perl
use Encode qw(encode decode);

my $bytes = encode("UTF-8", $text);       # text  → bytes
my $text  = decode("UTF-8", $bytes);      # bytes → text
```

The first argument names the encoding. UTF-8 is the obvious default
for new code, but `Encode` supports the `ISO-8859-*` family, every
common Windows code page, Shift-JIS, EUC-JP, EUC-KR, GB2312, GBK,
Big5, and dozens more.

Decoding is only reliable if you know which encoding the bytes are
in. Nothing in the bytes themselves tells you — that information
travels in a `Content-Type` header, a file's BOM, a protocol's
metadata field, or a convention you have agreed with the producer.
If you truly do not know, guess UTF-8 first, then fall back on
`ISO-8859-1` which at least never fails (every byte is a valid
code point in that encoding).

### Lossy encodings

Not every Unicode character survives every encoding. Converting
`"caf\x{e9}"` to `ISO-8859-1` works — `é` has code point `0xe9`
which that encoding can hold. Converting an em-dash `\x{2014}` to
`ISO-8859-1` does not — the character has no place in the target,
and `encode` replaces it with a substitution marker by default.

```perl
use Encode qw(encode);
my $s = "em\x{2014}dash";
my $b = encode("ISO-8859-1", $s);          # "em?dash"
```

Pass a third argument to `encode` to change that behaviour — see the
`Encode` module documentation for the full list of `Encode::FB_*`
constants. For most application code, the default substitution is
the right compromise.

## A complete round-trip

Putting decoding, processing, and encoding together:

```perl
use utf8;
use Encode qw(encode decode);

# bytes arrive from somewhere — HTTP, a file, a socket
my $input_bytes = "caf\xc3\xa9\n";

# step 1: decode to text
my $text = decode("UTF-8", $input_bytes);  # "café\n"

# step 2: process as text
chomp $text;
my $shouty = uc $text;                     # "CAFÉ"

# step 3: encode for output
my $output_bytes = encode("UTF-8", $shouty . "\n");
```

The text layer is short — one `uc` call — but that is the layer
where [`length`](../../p5/core/perlfunc/length), regex, case
conversion, and [`chomp`](../../p5/core/perlfunc/chomp) all work in
the units the program means.

## When you want I/O to do this automatically

Doing `decode` on every read and `encode` on every write quickly
becomes tedious. The next chapter — [I/O](io) — shows how to push
the conversion into the filehandle itself with the `:encoding`
layer, so ordinary [`print`](../../p5/core/perlfunc/print) and
[`readline`](../../p5/core/perlfunc/readline) already speak text.