Encode#
Convert between Perl character strings and bytes in any named encoding — UTF-8, UTF-16, Latin-1, CP1252, Shift_JIS, EUC-JP, and every other IANA-registered character set.
Encode works in two directions. decode($name, $bytes) takes raw bytes
in the named encoding and returns a Perl character string (the SVf_UTF8
flag is set on the result). encode($name, $string) goes the other
way: it takes a character string and returns raw bytes in the target
encoding. Keep the two operations mentally distinct — strings are
sequences of Unicode codepoints, bytes are what you read from and
write to files, sockets, and pipes.
Because UTF-8 is the internal form Perl uses for character strings,
it has a fast path: encode_utf8 and decode_utf8 skip the full
encoding machinery and just flip or validate the SVf_UTF8 flag.
Three low-level helpers let you poke at that flag directly:
is_utf8 queries it, _utf8_on forces it on, _utf8_off forces
it off. Use those only when you know what you are doing — they
change how Perl interprets the bytes already in the scalar without
touching the bytes themselves.
Every conversion takes an optional $check bitmask that controls
what happens when a character cannot be represented in the target
encoding. The predefined values are FB_DEFAULT (substitute with
? or the encoding’s replacement character), FB_CROAK (die),
FB_QUIET (stop and return the converted prefix), FB_WARN
(warn and substitute), FB_HTMLCREF (substitute with
&#NNNN;), FB_XMLCREF (substitute with &#xHHHH;), and
FB_PERLQQ (substitute with \x{HHHH}). LEAVE_SRC,
STOP_AT_PARTIAL, PERLQQ, WARN_ON_ERR, and
ONLY_PRAGMA_WARNINGS are the raw bits you OR together to build
custom check values.
Encoding names are resolved through a registry. find_encoding($name)
returns a blessed encoding object you can call methods on;
resolve_alias($name) returns the canonical name as a string;
encodings() lists every name the registry knows about.
from_to($octets, $from, $to) is a one-shot in-place conversion
useful when all you want is to reencode a byte string — for
example rewriting a file body from Latin-1 to UTF-8 without
unpacking it into characters first. It handles BOM-tagged and
MIME-tagged inputs when paired with find_mime_encoding.
Functions#
Encode/decode#
native_encode#
Turn a character string into a byte string in the named encoding.
native_decode#
Turn a byte string in the named encoding into a Perl character string.
native_encode_utf8#
Fast path for encoding a string to UTF-8 bytes.
native_decode_utf8#
Fast path for decoding UTF-8 bytes to a Perl character string.
UTF-8 flags#
native_is_utf8#
Return true if the scalar carries the SVf_UTF8 flag.
native_utf8_on#
Force SVf_UTF8 on in place, without touching the underlying bytes.
native_utf8_off#
Force SVf_UTF8 off in place, without touching the underlying bytes.
Encoding registry#
native_find_encoding#
Look up an encoding by name and return an object you can call methods on.
native_resolve_alias#
Return the canonical encoding name for an alias, or undef if unknown.
native_encodings#
Return the list of encoding names the registry knows about.
native_obj_encode#
Method form of encode on an encoding object.
native_obj_decode#
Method form of decode on an encoding object.
native_obj_name#
Return the canonical name of an encoding object as a string.
native_obj_renew#
Return a fresh encoding object (effectively a no-op returning $self).
native_obj_perlio_ok#
Return true if the encoding is safe to stack as a PerlIO layer.
MIME/XML helpers#
native_fb_htmlcref#
CHECK value 520 — replace unencodable characters with HTML decimal character references (&#NNNN;).
native_fb_xmlcref#
CHECK value 1032 — replace unencodable characters with XML hexadecimal character references (&#xHHHH;).
Conversion#
native_from_to#
Reencode a byte string in place from one encoding to another.
Utilities#
native_fb_default#
CHECK value 0 — substitute unencodable characters with the encoding’s default replacement (usually ? or U+FFFD).
native_fb_croak#
CHECK value 1 — die on the first unencodable character or invalid byte sequence.
native_fb_quiet#
CHECK value 4 — stop at the first unencodable character and return the encoded prefix; truncate the input to what was not consumed (method form only).
native_fb_warn#
CHECK value 6 — warn and substitute on unencodable input.
native_fb_perlqq#
CHECK value 264 — replace unencodable characters with Perl \x{HHHH} escape sequences.
native_leave_src#
CHECK bit 8 — when OR’d into $check, keeps the input scalar untouched; the default is to consume its successfully-encoded prefix.
native_stop_at_partial#
CHECK bit 2048 — stop at a partial trailing multi-byte sequence rather than reporting it as an error. Useful for streaming decoders.
native_perlqq#
CHECK bit 256 — the raw bit behind FB_PERLQQ; OR it into your own check mask for \x{HHHH} substitution.
native_warn_on_err#
CHECK bit 2 — emit a warning on encoding errors. Combined with a substitution bit to build custom fallback behaviour.
native_only_pragma_warnings#
CHECK bit 16 — emit encoding warnings only when the caller has use warnings 'utf8' (or equivalent) active, rather than unconditionally.