Since Everything2 pages do not contain explicit encoding
tags (and the user cannot specify them), the default character set
). This is great for English
also sufficient for most
Western European languages, since their accented characters
(é, ô, ñ, ä...) will show up just fine, but anything outside
the basic 255 will run into problems. There is exactly one
acceptable solution: using Unicode
as HTML character
As you may know, Unicode is a character set that will cover
every single script on the planet (and beyond).
Characters on the main plane of Unicode (U+0000 to U+FFFF), which
almost certainly include everything you will ever need,
can be accessed in HTML with the escape sequence
ode;. There are several distinct
and unique advantages to this approach:
- No character set switching. Characters encoded this way
are instantly visible, without the user tweaking his encodings,
fonts, etc. This is by far the most important single reason
to use Unicode on E2.
- Multiple languages in one page. Unicode characters are
distinct and unique, so they can be mixed and matched freely.
There is no
other way to use both, say, Hebrew and Arabic in the same writeup.
- Guaranteed E2 support. Character entities are interpreted
and stored as ordinary text, so they will never be mangled by EDB.
- Graceful failover. If the user's browser does not support
Unicode (or the subset in question), the user will see question
marks or little squares, instead of random 8-bit garbage, which may
include control codes that wreak havoc on formatting.
(Unfortunately, some very old and/or broken browsers may refuse to
recognize the existence of two-byte entities and print the entity
string in full, which will look horrible.)
There are, of course, a few downsides:
- Inefficiency. Each coded character entity takes up seven
bytes, whereas a national character set encoding may squeeze down
to one or two. For small quantities of text, this is not really an
- Difficulty of entering. Only a few programs can
generate HTML encoded characters automatically -- but some tips on
fixing this in the next section.
- Lack of support. Older browsers typically do not
support extended character entities at all, or require painful
manual configuration (esp. fonts) for them. Both Mozilla and later versions of
Internet Explorer support them quite well though right out of the box, and this problem
will gradually solve itself. (Also bear in mind that most older
systems that do not support Unicode without tinkering will also not
support any other encoding without tinkering.)
When to Use Unicode
Unicode character entities are at their best when you have to refer to
small bits of other languages in writeups written mostly in English.
For example, a writeup on Chinese astrology
want to mention the original characters
(天干) for what are in English dubbed the Heavenly Stems
Speakers of Hebrew
may want to trace how
, while those of Arabic
may wonder how
غزة became Gaza
. A writeup on
's metro system can't spell Kőbanya-Kispest
properly without using a character entity for ő
Students of Japanese
find out what Tokyo
And the list goes on! I recommend putting the Unicode in
parentheses after the transliteration or translation, so people who do not speak the language or whose
browsers do not support Unicode will still have some idea of what you are talking about.
When Not to Use Unicode
Material written entirely
in non-Latin1 languages, on the other
hand, is probably best written with some other encoding;
Unicode's own UTF-8
might not be a bad choice. As an experiment,
I did node the Three Gates of Tosotsu
text dating back
to 600 AD or so) in the original using character entities, but I
got a few complaints about screwy formatting -- Chinese doesn't
use spaces between words, so even a short line written as an unbroken
string of entities will stretch into hundreds of characters on systems
that do not fully support Unicode.
Using Unicode characters in node titles is also bit of an
iffy business, since they're usually pretty tough
to enter and also because EDB doesn't realize
that &#xhex;, �hex;,
&#dec; and �dec; are all
the same character. Then again, for "non-transscriptable"
languages like Hebrew and Arabic entering the words
in Unicode is pretty much the only way to get a unique
and identifiable name. But until the search code gets
tweaked for better support for non-Latin1 characters,
I would have to recommend keeping Unicode out of titles.
Notes on Composed, Right-To-Left and Other Odd Scripts
Some scripts, like Devanagari
, compose words from
individual letters. Some scripts, like Hebrew
, write from right
to left. A few scripts, like Arabic
, are both. Fortunately,
Unicode hides all the hellishly complex details of implementation, so
) is written in Unicode as
, and your browser's
rendering engine will automatically reverse the order and
join them as script so that ghain
and teh isolated
As these computations are left to the user's display engine, it is
possible that the browser does not know the proper
rendering method and that there are bugs in the rendering code --
for example, Mozilla (at time of writing)
still has some difficulties with bidirectional
nothing you can do about this, but again, browsers that dig Unicode
will usually get these right and the issue is irrelevant for
systems that don't support Unicode at all.
Unicode character entries can be written by hand by looking up
the code in a character table and entering them as
. Tables of codes
can be found at www.unicode.org
, the authoritative source,
, which gives the
characters packaged more conveniently as HTML tables.
This method is, however, intensely painful for anything more complex
than a single name. Also, while OK for alphabetic or
syllabic scripts, converting Japanese kanji or Chinese hanzi
(漢字) by browsing through
5000 characters is not fun.
Some tools can generate character entities on the fly, most
notably perhaps Microsoft Word
, which converts any script
into entities if you Save As... HTML
. Alas, this is
accompanied with lots of other HTML mangling, so for E2 you'll have
to pick out the entity by hand from the generated junk and paste
it back into the original. This is OK for one-off operations, but
soon becomes painful.
A better option is Java, which includes a remarkable set of
tools that can convert almost any encoding into Unicode and back.
Once the text is Unicode, it's a simple matter to extract the hex
code and pad it, and that's what my little utility J2U does.
You'll need a working Java environment to run J2U, writing an
applet interface to the tool is on my TODO list.
For Japanese, you can cut and paste strings in any encoding into
XJDIC or WWWJDIC
(at http://www.csse.monash.edu.au/~jwb/wwwjdic.html), after which performing an "Examine Kanji" on the word
gives the Unicode as Uxxxx. unicode.org's Unihan
database search provides similar facilities for all languages
that use 漢字.
A few more tools and tips sent in by kind noders:
- GNU Recode, for converting anything to anything else
- Mozilla's Composer, for realtime conversion of native IME input into HTML entities
Cheers to Gorgonzola, lj, Oolong, tres equis and WWWWolf for corrections and additions.