RFC 3629:UTF-8, a transformation format of ISO 106...
RFC-Ref

encoding


Click on the red underlined text to get to the source

... ISO/IEC 10646 and Unicode define several encoding forms of their common repertoire: UTF-8, UCS-2 ...
... UTF-16, UCS-4 and UTF-32. In an encoding form, each character is represented as one or more encoding units. All standard UCS ...
... UCS-4 and UTF-32. In an encoding form, each character is represented as one or more encoding units. All standard UCS encoding ...
... encoding units. All standard UCS encoding forms except UTF-8 have an encoding ...
... UCS encoding forms except UTF-8 have an encoding unit larger than one octet, making them hard to use in many current applications and protocols that assume 8 or even 7 bit ...
... UTF-8, the object of this memo, has a one-octet encoding unit. It uses all bits of an octet, but has the quality of preserving the full ...
... code point or Unicode scalar value). This encoding form has the following characteristics (all values are in hexadecimal): ...
... o Round-trip conversion is easy between UTF-8 and other encoding forms. ...
... simple algorithm, i.e., the probability that a string of characters in any other encoding appears as valid UTF-8 is low, ...


... The table below summarizes the format of these different octet types. The letter x indicates bits available for encoding bits of the character number. ...
... 0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx Encoding a character to UTF-8 proceeds as follows: ...
... The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent ...
... UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters. When encoding in UTF-8 ...
... U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters. When encoding in UTF-8 from UTF-16 data, it is necessary ...
... CESU-8 [CESU-8], which is a UTF-8-like encoding that is not meant for use on the Internet. CESU-8 operates similarly to UTF-8 ...
... number (code point). This leads to different results for character numbers above 0xFFFF; the CESU-8 encoding of those characters is NOT valid UTF-8 ...


... valid UTF-8 only if it matches the following syntax, which is derived from the rules for encoding UTF-8 and is expressed in the ABNF ...


... UCS characters and also to recognize which UCS encoding is involved and, with encodings having a multi-octet encoding ...
... UCS encoding is involved and, with encodings having a multi-octet encoding unit, as a way to ...
... encoding is involved and, with encodings having a multi-octet encoding unit, as a way to recognize the serialization order of the octets. UTF-8 ...
... recognize the serialization order of the octets. UTF-8 having a single-octet encoding unit, this last function is useless and the BOM will always appear as the octet sequence EF ...
... those textual protocol elements for which the protocol provides character encoding identification mechanisms, when it is expected that implementations of the protocol will be in a position to always use the mechanisms properly. This will be the case when ...
... those textual protocol elements for which the protocol does not provide character encoding identification mechanisms, when a ban would be unenforceable, or when it is expected that implementations of the protocol will not be in a position to ...
... obtain such entities from file systems, from protocols that do not have encoding identification mechanisms for payloads (such as FTP) ...
... FTP) or from other protocols that do not guarantee proper identification of character encoding (such as HTTP). ...
... element and react appropriately: using the signature to identify the character encoding as necessary and stripping or ignoring the signature as appropriate. ...


... including all amendments at least up to amendment 5 of the 1993 edition (Korean block), encoded to a sequence of octets using the encoding scheme outlined above. UTF-8 is suitable for use in MIME content types under the "text" top-level type ...


... Another security issue occurs when encoding to UTF-8: the ISO/IEC ...
... ISO/IEC 10646 description of UTF-8 allows encoding character numbers up to U+7FFFFFFF, yielding sequences of up to 6 bytes. There is therefore ...
... Security may also be impacted by a characteristic of several character encodings, including UTF-8: the "same thing" (as far as a user can tell) can be represented by several distinct character ...


... o Straightened out terminology. UTF-8 now described in terms of an encoding form of the character number. UCS-2 and UCS-4 almost ...


... Phipps, T., "Unicode Technical Report #26: Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8)", UTR 26, April 2002, <http://www.unicode.org/unicode/reports/tr26/ ...



Google
Web
RFC-Ref