UTF-16 (thing) by tongpoo - Everything2.com

This Unicode Transformation Format serializes each Unicode value as two bytes, or in case of values above U+FFFF, four bytes (a surrogate pair). A UTF-16 can be either in little-endian or big-endian format. An initial byte sequence called the byte order mark (BOM) is required for UTFs. The BOM is U+FEFF ZERO WIDTH NO-BREAK SPACE (therefore it doesn't do anything) and it can have several different byte sequences:

In UTF-8: EF BB BF
In UTF-16LE: FF FE (Windows 2000 default)
In UTF-16BE: FE FF

To prevent ambiguity, U+FFFE is not defined.

The Unicode codespace is allocated into several areas, one being the Surrogate Area, which consists of 1,024 high surrogates (U+D800 - U+DBFF) and 1,024 low surrogates (U+DC00 - U+DFFF). A high surrogate, followed by a low surrogate, forms a surrogate pair that represents a single Unicode scalar value. Approximately one million surrogate pairs are possible, and their values can be derived from this formula:

65536 + ((highSurrogate & 1023) << 10) + (lowSurrogate & 1023)

In plain English, it takes the the last ten binary digits from both surrogates, concatinates those, and adds 2¹⁶ to that number. As of Version 3.0, none of the surrogate pairs have been assigned.

UTF-16 on average can save about a byte per character over UTF-8 when encoding East Asian text.

Sources (PDF and PowerPoint files):

"The Unicode Standard, Version 3.0" Section 2.3, Encoding Forms.
http://www.Unicode.org/book/ch02.pdf

"The Unicode Standard, Version 3.0" Section 3.7, Surrogates.
http://www.Unicode.org/book/ch03.pdf

"The Unicode Standard, Version 3.0" Section 5.4, Handling Surrogate Pairs.
http://www.Unicode.org/book/ch05.pdf

"Surrogate Support in Microsoft Products."
http://www.Unicode.org/iuc/iuc18/papers/a8.ppt

UTF-8	UTF-32	UCS-2	Unicode
UTF-7	Unicode Transformation Format	UCS-4	Mule-UCS
big-endian	Cosmic Chasm	Specials	Making your own nuclear car bomb
surrogate pair	Surrogates Area	byte order mark	character set
NULL terminator	Tron	little-endian