Rationale

Since Everything2 pages do not contain explicit encoding tags (and the user cannot specify them), the default character set on Everything2 is ISO 8859-1 (aka Latin-1). This is great for English and also sufficient for most Western European languages, since their accented characters (é, ô, ñ, ä...) will show up just fine, but anything outside the basic 255 will run into problems. There is exactly one acceptable solution: using Unicode as HTML character entities.

As you may know, Unicode is a character set that will cover every single script on the planet (and beyond). Characters on the main plane of Unicode (U+0000 to U+FFFF), which almost certainly include everything you will ever need, can be accessed in HTML with the escape sequence &#xcode;. There are several distinct and unique advantages to this approach:

  • No character set switching. Characters encoded this way are instantly visible, without the user tweaking his encodings, fonts, etc. This is by far the most important single reason to use Unicode on E2.
  • Multiple languages in one page. Unicode characters are distinct and unique, so they can be mixed and matched freely. There is no other way to use both, say, Hebrew and Arabic in the same writeup.
  • Guaranteed E2 support. Character entities are interpreted and stored as ordinary text, so they will never be mangled by EDB.
  • Graceful failover. If the user's browser does not support Unicode (or the subset in question), the user will see question marks or little squares, instead of random 8-bit garbage, which may include control codes that wreak havoc on formatting. (Unfortunately, some very old and/or broken browsers may refuse to recognize the existence of two-byte entities and print the entity string in full, which will look horrible.)
There are, of course, a few downsides:
  • Inefficiency. Each coded character entity takes up seven bytes, whereas a national character set encoding may squeeze down to one or two. For small quantities of text, this is not really an issue.
  • Difficulty of entering. Only a few programs can generate HTML encoded characters automatically -- but some tips on fixing this in the next section.
  • Lack of support. Older browsers typically do not support extended character entities at all, or require painful manual configuration (esp. fonts) for them. Both Mozilla and later versions of Internet Explorer support them quite well though right out of the box, and this problem will gradually solve itself. (Also bear in mind that most older systems that do not support Unicode without tinkering will also not support any other encoding without tinkering.)

When to Use Unicode

Unicode character entities are at their best when you have to refer to small bits of other languages in writeups written mostly in English. For example, a writeup on Chinese astrology may want to mention the original characters (天干) for what are in English dubbed the Heavenly Stems. Speakers of Hebrew may want to trace how בית לחם became Bethlehem, while those of Arabic may wonder how غزة became Gaza. A writeup on Budapest's metro system can't spell Kőbanya-Kispest properly without using a character entity for ő. Students of Japanese can find out what Tokyo (東京) really means. And the list goes on! I recommend putting the Unicode in parentheses after the transliteration or translation, so people who do not speak the language or whose browsers do not support Unicode will still have some idea of what you are talking about.

When Not to Use Unicode

Material written entirely in non-Latin1 languages, on the other hand, is probably best written with some other encoding; Unicode's own UTF-8 might not be a bad choice. As an experiment, I did node the Three Gates of Tosotsu (a Zen text dating back to 600 AD or so) in the original using character entities, but I got a few complaints about screwy formatting -- Chinese doesn't use spaces between words, so even a short line written as an unbroken string of entities will stretch into hundreds of characters on systems that do not fully support Unicode.

Using Unicode characters in node titles is also bit of an iffy business, since they're usually pretty tough to enter and also because EDB doesn't realize that &#xhex;, &#x0hex;, &#dec; and &#0dec; are all the same character. Then again, for "non-transscriptable" languages like Hebrew and Arabic entering the words in Unicode is pretty much the only way to get a unique and identifiable name. But until the search code gets tweaked for better support for non-Latin1 characters, I would have to recommend keeping Unicode out of titles.

Notes on Composed, Right-To-Left and Other Odd Scripts

Some scripts, like Devanagari and Hangul, compose words from individual letters. Some scripts, like Hebrew, write from right to left. A few scripts, like Arabic, are both. Fortunately, Unicode hides all the hellishly complex details of implementation, so غزة (Gaza) is written in Unicode as ghain-zain-teh marbuta, غزة, and your browser's rendering engine will automatically reverse the order and join them as script so that ghain is initial, zain final and teh isolated.

As these computations are left to the user's display engine, it is possible that the browser does not know the proper rendering method and that there are bugs in the rendering code -- for example, Mozilla (at time of writing) still has some difficulties with bidirectional scripts. There is nothing you can do about this, but again, browsers that dig Unicode will usually get these right and the issue is irrelevant for systems that don't support Unicode at all.

Manual Entry

Unicode character entries can be written by hand by looking up the code in a character table and entering them as &#xcode;. Tables of codes can be found at www.unicode.org, the authoritative source, and www.hclrss.demon.co.uk/unicode, which gives the characters packaged more conveniently as HTML tables.

This method is, however, intensely painful for anything more complex than a single name. Also, while OK for alphabetic or syllabic scripts, converting Japanese kanji or Chinese hanzi (漢字) by browsing through 5000 characters is not fun.

Automated Conversion

Some tools can generate character entities on the fly, most notably perhaps Microsoft Word, which converts any script into entities if you Save As... HTML. Alas, this is accompanied with lots of other HTML mangling, so for E2 you'll have to pick out the entity by hand from the generated junk and paste it back into the original. This is OK for one-off operations, but soon becomes painful.

A better option is Java, which includes a remarkable set of tools that can convert almost any encoding into Unicode and back. Once the text is Unicode, it's a simple matter to extract the hex code and pad it, and that's what my little utility J2U does. You'll need a working Java environment to run J2U, writing an applet interface to the tool is on my TODO list.

For Japanese, you can cut and paste strings in any encoding into XJDIC or WWWJDIC (at http://www.csse.monash.edu.au/~jwb/wwwjdic.html), after which performing an "Examine Kanji" on the word gives the Unicode as Uxxxx. unicode.org's Unihan database search provides similar facilities for all languages that use 漢字.

A few more tools and tips sent in by kind noders:

  • GNU Recode, for converting anything to anything else
  • Mozilla's Composer, for realtime conversion of native IME input into HTML entities

Cheers to Gorgonzola, lj, Oolong, tres equis and WWWWolf for corrections and additions.