UTF-8
Click on the red underlined text to get to the source
... encoding
([RFC2152]). UTF-8, the object of this memo, uses all bits of an
octet, but has the quality of preserving the full US-ASCII ...
... reserved range. UTF-16 impacts
UTF-8 in that UCS-2 values from the reserved range must be treated
...
... US-ASCII values do not appear otherwise in a UTF-8 encoded
character stream. This provides compatibility ...
... UTF-8 strings can be fairly reliably recognized as such by a
simple algorithm, i.e. the probability that a string of characters
...
... in any other encoding appears as valid UTF-8 is low, diminishing
with increasing string length.
...
... The original authors were Gary Miller, Greger Leijonhufvud and John
Entenmann. Later, Ken Thompson and Rob Pike did significant work for
the formal UTF-8.
...
... UNICODE]. The definitive
reference, including provisions for UTF-16 data within UTF-8, is
Annex R of ISO/IEC 10646-1 [ISO-10646 ...
... UTF-8 definition ...
...
In UTF-8, characters are encoded using sequences of 1 to 6 octets.
The only octet of a "sequence" of one has the higher-order bit set to
...
... UCS-4 range (hex.) UTF-8 octet sequence (binary)
0000 0000-0000 007F 0xxxxxxx
0000 0080-0000 07FF 110xxxxx 10xxxxxx
...
... encoding UCS-2 (or Unicode) to UTF-8 can be
obtained from the above, in principle, by simply extending each
UCS-2 ...
... bits are left.
If the UTF-8 sequence is no more than three octets long, decoding
can proceed directly to UCS-2.
...
... should protect against decoding invalid sequences. For
instance, a naive implementation may (wrongly) decode the
invalid UTF-8 sequence C0 80 into the character U+0000, which
may have security consequences and/or cause other problems. See
...
... UCS-2 sequence "A<NOT IDENTICAL TO><ALPHA>." (0041, 2262, 0391,
002E) may be encoded in UTF-8 as follows:
...
... CHARSET-REG]. The proposed
charset parameter value is "UTF-8". This string labels media types
containing text consisting of characters from the repertoire of
...
... (Korean block), encoded to a sequence of octets using the encoding
scheme outlined above. UTF-8 is suitable for use in MIME content
types under the "text" top-level type ...
...
It is noteworthy that the label "UTF-8" does not contain a version
identification, referring generically to ISO/IEC ...
... charset parameter value
"UNICODE-1-1-UTF-8", for the exclusive purpose of labelling text data
containing Hangul syllables encoded to UTF-8 without taking into
...
... UNICODE-1-1-UTF-8", for the exclusive purpose of labelling text data
containing Hangul syllables encoded to UTF-8 without taking into
account Amendment 5 of ISO/IEC 10646 (i.e. using the pre-amendment 5
...
... ISO/IEC 10646 (i.e. using the pre-amendment 5
code point assignments). Any other UTF-8 data SHOULD NOT use this
label, in particular data not containing any Hangul syllables, and it
...
...
Implementors of UTF-8 need to consider the security aspects of how
they handle illegal UTF-8 ...
... UTF-8 need to consider the security aspects of how
they handle illegal UTF-8 sequences. It is conceivable that in some
circumstances an attacker would be able to exploit an incautious
...
... circumstances an attacker would be able to exploit an incautious
UTF-8 parser by sending it an octet sequence that is not permitted by
the UTF-8 syntax.
...
... critical validity checks
against the UTF-8 encoded form of its input, but interprets certain
illegal octet sequences as characters. For example, a parser might
prohibit the NUL character when encoded as the single-octet sequence
...
... Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: Architecture and Basic Multilingual Plane. Five amendments and a technical corrigendum have been published up to now. UTF-8 is described in Annex R, published as Amendment 2. UTF-16 is described in Annex Q, published as Amendment 1. 17 other amendments are currently at various stages of standardization. ...
