An encoding of Unicode and ISO 10646 characters as multibyte characters, explained in RFC 2044. Is becoming popular in newer Internet standards because it is a superset of US-ASCII, doesn't contain US-ASCII characters other than as themselves, can represent the full international character set range of Unicode, doesn't require shifts, and allows re-synchronization in the middle of a string.

The Unicode (or UTF-32) character, up to 31 bits, is encoded as follows:

Unicode (Hex).... Byte1... Byte2... Byte3... Byte4... Byte5... Byte6...
00000000-0000007f 0xxxxxxx
00000080-000007ff 110xxxxx 10xxxxxx
00000800-0000ffff 1110xxxx 10xxxxxx 10xxxxxx
00080000-001fffff 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
00200000-03ffffff 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxx
04000000-7fffffff 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxx 10xxxxx

Each 'x' indicates a single bit from the Unicode encoding of the character.

You must use the shortest possible encoding for each character, so a given character can only appear with one encoding, to avoid security problems with unwanted characters getting past filtering. UTF-16 "surrogate characters" are not allowed (though they appear everywhere due to stupid translation software). I recommend that illegal sequences of bytes be treated as though each byte was a character (in the range 0x80-0xff). This would allow most ISO-8859-1 text to be passed as UTF-8 without any changes. Not everybody agrees with this idea as it would allow security problems, but I recommend that no software assign special meaning (ie as seperators) to any characters greater than 0x80).

Among the interesting properties of UTF-8 is that sorting this string produces the same ordering that sorting the UTF-32 would produce, that all continuation characters start with the bits 1 and 0 (making it easy to find the character divisions), and that no ASCII characters or control characters or 0xFE or 0xFF can be confused with any multi-byte characters (allowing UTF-8 strings to be passed through normal strcpy() and other byte-oriented string manipulation software).

I believe the only reason UTF-8 is not used everywhere is political correctness: some people think that it is unfair that English gets the shorter characters and that we should all have equal-sized characters to demonstrate world equality. In reality, due to the use of spaces, numbers, and embedded English words, almost every language in the world is shorter in UTF-8 than in UTF-16!

UTF-8 will probably begin to enjoy wider use because it's the default character set for XML.

Lots of information at http://www.cl.cam.ac.uk/~mgk25/unicode.html

The 8-bit Unicode Transformation Format is a computer format for text data. It is a way to store or transmit text that contains the advantages of both Unicode in what it can represent, and the compactness of plain old ASCII.

UTF-8 is a way to encode Unicode text so that the most usual characters take up one byte each, and other characters take 2 to 4 bytes each.

The number of bytes used by UTF-8 to represent a character depends on the Unicode character number – characters with higher Unicode numbers take up more bytes. If the Unicode character number is in the range 0-127, i.e. characters identical to the US-ASCII character set, then this number, padded out to 8 bits with a leading 0, is the UTF-8 encoding.

All other characters are represented by bit strings longer than one byte, with a leading 1 in each byte. The first byte starts with a number of 1's equal to the number of bytes, then a zero. Subsequent bytes start with 10. E.g. a two byte character is represented by bits of the form 110xxxxx 10xxxxxx and a four-byte character is represented by of the form 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx.

The xxx bits contain the Unicode number of the character. Note that this leaves only 3-6 bits per byte over to store the character code, so some characters with character numbers less than 16 bits long must be stored in three bytes instead of two.

UTF-8 is defined to have only one right way to encode each character. Where more than one possible encoding is possible, the shortest encoding is the only right one. This is done to prevent the security problem of having strings that look identical to the user but differ to the machine, as is used in URL domain name spoofing.

UTF-8 is the default encoding for XML documents.

Advantages

An advantage of UTF8 over other Unicode encodings is that, assuming you are writing text in the Latin alphabet, most byte values in UTF-8 data will be the same as in the equivalent ASCII file. In fact, if you stick to standard roman letters, control characters and punctuation, then the UTF-8 encoding of the entire data will be bit-for-bit identical to the ASCII encoding.

UTF-8 saves space over UTF-16 or UTF-32 encodings of Unicode text, for the common case where 7-byte characters predominate.

UTF-8 has the advantage over ASCII that the full range of Unicode characters can be stored therein.

A byte sequence that represents an entire UTF-8 character can never occur as a substring of a longer character. This makes parsing UTF-8 simpler.

Null bytes (all zero) never occur in UTF-8 text except to encode the null character. This contrasts with UTF-16, where leading zero bytes are added to all characters in the range 0-255 (i.e. all normal ascii chars). This is important, as much old program code, especially in the C programming language, that is coded with plain ASCII text in mind, interprets a null byte as the end of the text.

The length of the character can be determined by looking at the first bit. If it is zero, then the character is one byte long. Otherwise, count the leading ones.

A reader can synchronise with an UTF-8 stream that it intercepts in mid-stream. The next character start byte will always start with the bit 0 or with the bits 11 (or to put it differently, the next character start byte will be the first byte that doesn't start with 10).

Disadvantages

UTF8 has the disadvantage that many eastern alphabets characters are represented by 3 bytes each, whereas in UTF-16 each character with number under 216, Latin or Oriental, takes up 2 bytes. For a document containing mostly Japanese, Chinese or Korean text, UTF-16 may be a more efficient encoding.

Variable-width characters are more complex to process than fixed width characters.

Data compression is sometimes performed on UTF-8 data to remove the redundancy imposed by the UTF-encoding scheme. This is seen as a separate issue to encoding.

Many UTF-8 parsers do not check for illegal characters where an alternative shorter encoding exists, and thus could possibly be exploited in this way.

History

UTF-8 was invented by Ken Thompson in 1992 and implemented by Rob Pike and Ken Thompson in the plan 9 operating system immediately thereafter. It was initially supported by IBM.
UTF-8 is described in RFC 3629, and mandated by RFC 2277.

For more details see wikipedia: http://en.wikipedia.org/wiki/Utf-8

Log in or register to write something here or to contact authors.