native_decode#

Turn a byte string in the named encoding into a Perl character string.

Synopsis#

my $string = decode($encoding, $bytes);
my $string = decode($encoding, $bytes, $check);

What you get back#

A scalar holding a Perl character string — the SVf_UTF8 flag is set on the result. Each element of that string is one Unicode codepoint; indexing with substr and counting with length returns characters, not bytes.

If $encoding is unknown, decode croaks with Unknown encoding '...'.

The optional $check argument controls what happens when the bytes are not valid in the source encoding:

  • FB_DEFAULT (0, the default) — substitute invalid sequences with U+FFFD (the Unicode replacement character).

  • FB_CROAK — die on the first invalid byte.

  • FB_QUIET — decode the valid prefix and stop. In the method form, the consumed prefix is also removed from $bytes.

  • FB_WARN — warn and substitute with U+FFFD.

Examples#

Decode UTF-8 bytes read from a file:

my $string = decode('UTF-8', "caf\xc3\xa9");

## length($string) == 4, fourth char is U+00E9

Decode CP1252 bytes, turning Windows “smart quotes” into Unicode:

my $string = decode('cp1252', "\x93hi\x94");

## $string is "\x{201c}hi\x{201d}"

Die on malformed UTF-8:

use Encode qw(decode FB_CROAK);
my $string = decode('UTF-8', "\xc3\x28", FB_CROAK);

## dies: utf8 "\xC3" does not map to Unicode

Edge cases#

  • undef input returns an empty character string.

  • Encoding "null" passes input bytes through unchanged.

  • A byte string that is already valid UTF-8 is re-tagged with SVf_UTF8 without reallocating.

Differences from upstream#

Fully compatible with upstream for ASCII, Latin-1, CP1252, the ISO-8859 family, and UTF-8. Covered by t/81-xs-native/Encode/010-basic.t and t/81-xs-native/Encode/090-decode-inplace.t.

See also#

  • encode — the inverse, string to bytes.

  • decode_utf8 — the UTF-8-only fast path.

  • from_to — reencode in place without materialising a character string in between.