Unicode encoded as two bytes per character. The obvioius way to do this is to put the bottom 16 bits into the two bytes (high byte first so sorting order is preserved), and this is called UCS-2. When people realized that (due to Chinese, mostly) more than 65,536 characters were needed, they came up with this bastard encoding, rather than using UTF-8, which is a sensible encoding. MicroSoft uses this encoding in their stuff, sigh.

UTF-16 can encoded Unicode up to 0x10ffff. All codes less than 0xffff but not in the range 0xd800-0xdfff are encoded high byte first, low byte second.

The "characters" 0xd800-0xdfff are called "surrogate characters" and must appear in pairs. These are combined in a complex way to produce the characters in the range 0x10000 through 0x10ffff. They also defeat the only plausible advantage of UTF-16, which is that the characters are the same size!

Don't use this, it is just proof that the standards people have their heads up their asses. Use UTF-8 instead.

This Unicode Transformation Format serializes each Unicode value as two bytes, or in case of values above U+FFFF, four bytes (a surrogate pair). A UTF-16 can be either in little-endian or big-endian format. An initial byte sequence called the byte order mark (BOM) is required for UTFs. The BOM is U+FEFF ZERO WIDTH NO-BREAK SPACE (therefore it doesn't do anything) and it can have several different byte sequences: To prevent ambiguity, U+FFFE is not defined.

The Unicode codespace is allocated into several areas, one being the Surrogate Area, which consists of 1,024 high surrogates (U+D800 - U+DBFF) and 1,024 low surrogates (U+DC00 - U+DFFF). A high surrogate, followed by a low surrogate, forms a surrogate pair that represents a single Unicode scalar value. Approximately one million surrogate pairs are possible, and their values can be derived from this formula:
65536 + ((highSurrogate & 1023) << 10) + (lowSurrogate & 1023)
In plain English, it takes the the last ten binary digits from both surrogates, concatinates those, and adds 216 to that number. As of Version 3.0, none of the surrogate pairs have been assigned.

UTF-16 on average can save about a byte per character over UTF-8 when encoding East Asian text.

Sources (PDF and PowerPoint files):
  • "The Unicode Standard, Version 3.0" Section 2.3, Encoding Forms.

  • "The Unicode Standard, Version 3.0" Section 3.7, Surrogates.

  • "The Unicode Standard, Version 3.0" Section 5.4, Handling Surrogate Pairs.

  • "Surrogate Support in Microsoft Products."

Log in or register to write something here or to contact authors.