encoding
Click on the red underlined text to get to the source
... ISO/IEC 10646 and Unicode define several encoding forms of their
common repertoire: UTF-8, UCS-2 ...
... UTF-16, UCS-4 and UTF-32. In an
encoding form, each character is represented as one or more encoding
units. All standard UCS ...
... UCS-4 and UTF-32. In an
encoding form, each character is represented as one or more encoding
units. All standard UCS encoding ...
... UCS encoding forms except UTF-8 have an encoding
unit larger than one octet, making them hard to use in many current
applications and protocols that assume 8 or even 7 bit ...
... UTF-8, the object of this memo, has a one-octet encoding unit. It
uses all bits of an octet, but has the quality of preserving the full
...
... code point or Unicode scalar value). This
encoding form has the following characteristics (all values are in
hexadecimal):
...
... simple algorithm, i.e., the probability that a string of
characters in any other encoding appears as valid UTF-8 is low,
...
... The table below summarizes the format of these different octet types.
The letter x indicates bits available for encoding bits of the
character number.
...
... 0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Encoding a character to UTF-8 proceeds as follows:
...
...
The definition of UTF-8 prohibits encoding character numbers between
U+D800 and U+DFFF, which are reserved for use with the UTF-16
encoding form (as surrogate pairs) and do not directly represent
...
... UTF-8 prohibits encoding character numbers between
U+D800 and U+DFFF, which are reserved for use with the UTF-16
encoding form (as surrogate pairs) and do not directly represent
characters. When encoding in UTF-8 ...
... U+D800 and U+DFFF, which are reserved for use with the UTF-16
encoding form (as surrogate pairs) and do not directly represent
characters. When encoding in UTF-8 from UTF-16 data, it is necessary
...
... CESU-8 [CESU-8], which is a UTF-8-like encoding that is not meant for
use on the Internet. CESU-8 operates similarly to UTF-8 ...
... number (code point). This leads to different results for character
numbers above 0xFFFF; the CESU-8 encoding of those characters is NOT
valid UTF-8 ...
... valid UTF-8 only if it matches the
following syntax, which is derived from the rules for encoding UTF-8
and is expressed in the ABNF ...
... UCS
characters and also to recognize which UCS encoding is involved and,
with encodings having a multi-octet encoding ...
... UCS encoding is involved and,
with encodings having a multi-octet encoding unit, as a way to
...
... encoding is involved and,
with encodings having a multi-octet encoding unit, as a way to
recognize the serialization order of the octets. UTF-8 ...
... recognize the serialization order of the octets. UTF-8 having a
single-octet encoding unit, this last function is useless and the BOM
will always appear as the octet sequence EF ...
... those textual protocol elements for which the protocol provides
character encoding identification mechanisms, when it is expected
that implementations of the protocol will be in a position to
always use the mechanisms properly. This will be the case when
...
... those textual protocol elements for which the protocol does not
provide character encoding identification mechanisms, when a ban
would be unenforceable, or when it is expected that
implementations of the protocol will not be in a position to
...
... obtain such entities from file systems, from protocols that do not
have encoding identification mechanisms for payloads (such as FTP)
...
... FTP)
or from other protocols that do not guarantee proper
identification of character encoding (such as HTTP).
...
... element and react appropriately: using the
signature to identify the character encoding as necessary and
stripping or ignoring the signature as appropriate.
...
... including all amendments at least up to amendment 5 of the 1993
edition (Korean block), encoded to a sequence of octets using the
encoding scheme outlined above. UTF-8 is suitable for use in MIME
content types under the "text" top-level type ...
... ISO/IEC
10646 description of UTF-8 allows encoding character numbers up to
U+7FFFFFFF, yielding sequences of up to 6 bytes. There is therefore
...
... Security may also be impacted by a characteristic of several
character encodings, including UTF-8: the "same thing" (as far as a
user can tell) can be represented by several distinct character
...
... o Straightened out terminology. UTF-8 now described in terms of an
encoding form of the character number. UCS-2 and UCS-4 almost
...
... Phipps, T., "Unicode Technical Report #26: Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8)", UTR 26, April 2002, <http://www.unicode.org/unicode/reports/tr26/ ...
