Due to complexities of the characters used in Japanese and many other Asian languages, the standard 8-bit ANSI character set is incapable of holding but a small subset of even the basic kana of Japanese. To counter this, many encoding formats have been created to handle the display of characters.

There are at least 5 different encoding formats used primarily for Asian languages, the 3 remaining cover many more languages.

JIS

The first JIS standard, now referred to as Old JIS, has a lot in common with the New JIS standard.
The format utilizes a two byte opening designator, followed by two byte codes, giving a 16-bit number that designates a certain kanji/kana to use. It then closes the line with another two byte designator. The only difference is the codeset.

Shift-JIS is a microsoft developed encoding set. It does not include the opening or closing characters, and uses a different codeset.

NEC JIS is another alteration to the JIS format. NEC JIS, like Old and New JIS, uses the two byte opening and closing designators, but does not share them, nor the same codeset, as the previous formats.

EUC

EUC - Extended Unix Code
This, much like the other formats, is a multibyte format that was designed by AT&T and was supported on System V to represent the asian character sets. It is an ISO standard (ISO2022).

From http://cns-web.bu.edu/pub/djohnson/web_files/i18n/euc.html

EUC defines a variable length multibyte encoding intended primarily for interchange, and a fixed length encoding primarily intended for processing.



EUC does not use opening or closing designators, nor share any codesets with JIS, nor Unicode

Unicode

Unicode is probably the most well known of character encoding formats, as it supports most character sets on the planet. Though it is powerful, it is not the most popular. A lot of the conflict arises from the fact that during the design of Unicode, the 3 main kanji-based languages Chinese, Japanese, and parts of Korean were merged. This has created much political discourse, and frustrated input method developers as the arrangement renders many common search methods useless.

Unicode, and UTF-8, are essentially the same specification. They work, much like the others, by providing a 16-bit space in which over 65000 characters can be contained. Unlike the others, their specification is universal, and includes all written languages on earth in use today, plus space reserved for future use, user-specified characters, and a compatibility region.

In it's UTF-8 form, two bytes are combined to form the 16-bit address of the character, and when used in user interfaces or some webpages are interpreted as one character. This can play funny on some systems, and cause character breaking. To circumvent this, a larger (but more accurate and universal) method involves the ability to use all unicode characters, as HTML entities, such as &#x????; where ???? is a hexadecimal number between 0000 and FFFF (within the valid range of characters, or &#XXXXX, where XXXXX is between 00000 and 65535 (within the valid range of characters).

UTF-8's encoding behavior is unusual, as you can see here:

+TgA- ichi
+Tow- ni
+Tgk- san
+TgA- ichi
+TgBOjA- ichi ni
+TgBOjE4J- ichi ni san

The same issue does not arise when using entities in HTML, as one code represents one character in all cases.

Some encoding samples (test your browser, IE and Mozilla should both work):

ÆüËܸì -­ nihongo, EUC
“ú–{Œê - nihongo, Shift-JIS
$BF|K\8l(J - nihongo, New JIS
KF|K\8lH - nihongo, NEC JIS
åe,gžŠ - UNICODE
+ZeVnLIqe- - UTF-7
日本語 - Unicode HTML Entities

Neither IE nor Mozilla support all of these formats, as such you'll need programs that can.
Programs for using these formats:
JWPce - Supports all afore mentioned file formats.
Microsoft Global IME - Supports Unicode and Shift-JIS

A few words on encoding usage and E2:

All kanji/Japanese text (indeed, all languages) can and SHOULD be done using HTML entities. On windows, Mozilla paired with the IME2000 in Windows2000/XP can input directly into HTML entities with no additional conversion necessary.

Any questions on formats/software should be directed to the E2 Bakufu, at least one of whom can probably answer any question you may have on this subject.

Log in or register to write something here or to contact authors.