native_encode#

Turn a character string into a byte string in the named encoding.

Synopsis#

my $bytes = encode($encoding, $string);
my $bytes = encode($encoding, $string, $check);

What you get back#

A scalar holding raw bytes. The SVf_UTF8 flag on the result is always off: this is the form you write to files, sockets, and pipes. The input $string is treated as a sequence of Unicode codepoints regardless of how it is internally represented.

If $encoding is unknown, encode croaks with Unknown encoding '...'.

The optional $check argument controls what happens when a character cannot be represented in the target encoding:

  • FB_DEFAULT (0, the default) — substitute with ? or the encoding’s replacement character.

  • FB_CROAK — die on the first unencodable character.

  • FB_QUIET — stop at the first unencodable character and return the encoded prefix. In the method form, the consumed prefix is also removed from $string.

  • FB_WARN — warn and substitute.

  • FB_PERLQQ — substitute with \x{HHHH}.

  • FB_HTMLCREF — substitute with &#NNNN;.

  • FB_XMLCREF — substitute with &#xHHHH;.

Examples#

Encode a string to UTF-8 bytes for writing to a file:

my $bytes = encode('UTF-8', "caf\x{e9}");

## $bytes is "caf\xc3\xa9" — 4 bytes, no SVf_UTF8

Encode to Latin-1, losing characters that don’t fit:

my $bytes = encode('iso-8859-1', "\x{20ac}");   # Euro sign

## $bytes is "?" — U+20AC has no Latin-1 byte

Die if anything can’t be encoded:

use Encode qw(encode FB_CROAK);
my $bytes = encode('ascii', "caf\x{e9}", FB_CROAK);

## dies: "\x{e9}" does not map to ascii

Edge cases#

  • undef input returns an empty byte string.

  • Input without SVf_UTF8 is treated as Latin-1 bytes and reencoded as such.

  • Encoding "null" passes input bytes through unchanged.

Differences from upstream#

Fully compatible with upstream for ASCII, Latin-1, CP1252, the ISO-8859 family, and UTF-8. Shift_JIS, EUC-JP, and other multi-byte encodings are not yet registered in the static table and fall back to the Latin-1 identity mapping. Covered by t/81-xs-native/Encode/040-encode-utf8-latin1.t and t/81-xs-native/Encode/060-check-parameter.t.

See also#

  • decode — the inverse, bytes to string.

  • encode_utf8 — the UTF-8-only fast path.

  • from_to — reencode in place without materialising a character string in between.