RFC 2279:UTF-8, a transformation format of ISO 106...
RFC-Ref

UTF-8


Click on the red underlined text to get to the source

... encoding ([RFC2152]). UTF-8, the object of this memo, uses all bits of an octet, but has the quality of preserving the full US-ASCII ...
... reserved range. UTF-16 impacts UTF-8 in that UCS-2 values from the reserved range must be treated ...
... UCS-2 values from the reserved range must be treated specially in the UTF-8 transformation. ...
... UTF-8 encodes UCS-2 or UCS-4 characters as a varying number of ...
... consequence is that a plain ASCII string is also a valid UTF-8 string. ...
... US-ASCII values do not appear otherwise in a UTF-8 encoded character stream. This provides compatibility ...
... Round-trip conversion is easy between UTF-8 and either of UCS-4, UCS-2 ...
... The Boyer-Moore fast search algorithm can be used with UTF-8 data. ...
... UTF-8 strings can be fairly reliably recognized as such by a simple algorithm, i.e. the probability that a string of characters ...
... in any other encoding appears as valid UTF-8 is low, diminishing with increasing string length. ...
... UTF-8 was originally a project of the X/Open Joint Internationalization Group ...
... The original authors were Gary Miller, Greger Leijonhufvud and John Entenmann. Later, Ken Thompson and Rob Pike did significant work for the formal UTF-8. ...
... UNICODE]. The definitive reference, including provisions for UTF-16 data within UTF-8, is Annex R of ISO/IEC 10646-1 [ISO-10646 ...


... UTF-8 definition ...
... In UTF-8, characters are encoded using sequences of 1 to 6 octets. The only octet of a "sequence" of one has the higher-order bit set to ...
... UCS-4 range (hex.) UTF-8 octet sequence (binary) 0000 0000-0000 007F 0xxxxxxx 0000 0080-0000 07FF 110xxxxx 10xxxxxx ...
... Encoding from UCS-4 to UTF-8 proceeds as follows: ...
... encoding UCS-2 (or Unicode) to UTF-8 can be obtained from the above, in principle, by simply extending each UCS-2 ...
... Decoding from UTF-8 to UCS-4 proceeds as follows: ...
... bits are left. If the UTF-8 sequence is no more than three octets long, decoding can proceed directly to UCS-2. ...
... should protect against decoding invalid sequences. For instance, a naive implementation may (wrongly) decode the invalid UTF-8 sequence C0 80 into the character U+0000, which may have security consequences and/or cause other problems. See ...


... UCS-2 sequence "A<NOT IDENTICAL TO><ALPHA>." (0041, 2262, 0391, 002E) may be encoded in UTF-8 as follows: ...


... CHARSET-REG]. The proposed charset parameter value is "UTF-8". This string labels media types containing text consisting of characters from the repertoire of ...
... (Korean block), encoded to a sequence of octets using the encoding scheme outlined above. UTF-8 is suitable for use in MIME content types under the "text" top-level type ...
... It is noteworthy that the label "UTF-8" does not contain a version identification, referring generically to ISO/IEC ...
... charset parameter value "UNICODE-1-1-UTF-8", for the exclusive purpose of labelling text data containing Hangul syllables encoded to UTF-8 without taking into ...
... UNICODE-1-1-UTF-8", for the exclusive purpose of labelling text data containing Hangul syllables encoded to UTF-8 without taking into account Amendment 5 of ISO/IEC 10646 (i.e. using the pre-amendment 5 ...
... ISO/IEC 10646 (i.e. using the pre-amendment 5 code point assignments). Any other UTF-8 data SHOULD NOT use this label, in particular data not containing any Hangul syllables, and it ...


... Implementors of UTF-8 need to consider the security aspects of how they handle illegal UTF-8 ...
... UTF-8 need to consider the security aspects of how they handle illegal UTF-8 sequences. It is conceivable that in some circumstances an attacker would be able to exploit an incautious ...
... circumstances an attacker would be able to exploit an incautious UTF-8 parser by sending it an octet sequence that is not permitted by the UTF-8 syntax. ...
... UTF-8 parser by sending it an octet sequence that is not permitted by the UTF-8 syntax. ...
... critical validity checks against the UTF-8 encoded form of its input, but interprets certain illegal octet sequences as characters. For example, a parser might prohibit the NUL character when encoded as the single-octet sequence ...


... Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: Architecture and Basic Multilingual Plane. Five amendments and a technical corrigendum have been published up to now. UTF-8 is described in Annex R, published as Amendment 2. UTF-16 is described in Annex Q, published as Amendment 1. 17 other amendments are currently at various stages of standardization. ...



Google
Web
RFC-Ref