This Unicode Transformation Format
serializes each Unicode value as two bytes, or in case of values above U+FFFF, four bytes (a surrogate pair
). A UTF-16 can be either in little-endian
format. An initial byte sequence called the byte order mark
(BOM) is required for UTFs. The BOM is U+FEFF ZERO WIDTH NO-BREAK SPACE
(therefore it doesn't do anything) and it can have several different byte sequences:
To prevent ambiguity, U+FFFE is not defined.
The Unicode codespace is allocated into several areas, one being the Surrogate Area, which consists of
1,024 high surrogates (U+D800 - U+DBFF) and
1,024 low surrogates (U+DC00 - U+DFFF).
A high surrogate, followed by a low surrogate, forms a surrogate pair
that represents a single Unicode scalar value. Approximately one million surrogate pairs are possible, and their values can be derived from this formula:
65536 + ((highSurrogate & 1023) << 10) + (lowSurrogate & 1023)
In plain English, it takes the the last ten binary digits from both surrogates, concatinates those, and adds 216
to that number.
As of Version 3.0, none of the surrogate pairs have been assigned.
UTF-16 on average can save about a byte per character over UTF-8 when encoding East Asian text.
Sources (PDF and PowerPoint files):
- "The Unicode Standard, Version 3.0" Section 2.3, Encoding Forms.
- "The Unicode Standard, Version 3.0" Section 3.7, Surrogates.
- "The Unicode Standard, Version 3.0" Section 5.4, Handling Surrogate Pairs.
- "Surrogate Support in Microsoft Products."