Encoding Unicode in the obvious way as two bytes per character. The standard is the high byte is first (to allow string sorting to match), but due to the prevailance of small-endian Intel processors and lazy programmers in Seattle, this data is often low-byte first.

In order to allow the automatic detection of the byte order, it has become customary on some platforms (notably Win32) to start every Unicode file with the character U+FEFF (ZERO WIDTH NO-BREAK SPACE), also known as the Byte-Order Mark (BOM). Its byte-swapped equivalent U+FFFE is not a valid Unicode character, therefore it helps to unambiguously distinguish the Bigendian and Littleendian variants of UTF-16 and UTF-32.

This is not exactly the same as UTF-16 but pretty close. UTF-16 contains bogus enhancements to make it encode more than 65536 possible characters.

I strongly recommend the use of UTF-8 for all text processing.

Log in or register to write something here or to contact authors.