Choosing a writing system for a language

The choice of a specific writing system for a language is motivated by a great many factors, and choosing a writing system isn't as simple as finding something that looks pretty. Political and cultural factors are one of the most important factors; Hindi is written in the Devanagari syllabic alphabet (also known as an 'abugida'), while Urdu is written in a heavily-modified form of the Arabic consonantal alphabet (an 'abjad'). Some might consider Hindi and Urdu to be the same language (due to their common origin in the Brahmic language family) and high degree of mutual intelligibility, but the speakers themselves mostly consider them to be two different languages due to the complex web of political and cultural realities that separate the speakers of these languages --- Urdu is spoken mainly in Pakistan, and Hindi in India. The same situation applies for Serbian (written in the Cyrillic alphabet) and Croatian (written in the Latin alphabet). Political and cultural separation are literally written down in black and white, and have a powerful visual effect in delineating group identities.

Practical linguistic considerations also apply. Some languages are suited to one type of writing system, and some are suited to others. This, much to the chagrin of linguists, often takes a back seat to the above-mentioned concerns. Turkish is one example of a language whose former writing system (a kludge of the Arabic abjad) was decided on political grounds, even though it was very poorly-suited to writing down the Turkish language: recall that Turkey was part of the Ottoman Empire, and it has been part of the Islamic world for more than a millennium. Using the Arabic script was a decision on part of the scribes (and those who paid them) to cement its identity in the Ottoman world, until the 1920s, when Mustafa Kemal Atatürk decreed that Turkey should henceforth use a variant of the Latin alphabet, the same one used today.

Similar considerations applied when the Western world began colonizing (and 'civilizing') the rest of the world. When the powers-that-be saw it fit to allow the natives to continue to use their languages, efforts were expended to develop writing systems for them, whether on the initiative of the natives themselves or on the decrees of their colonial masters. These writing systems tended to be the same as those used by the colonizers; central Asian tribes had Cyrillic orthographies mapped onto their languages, the European colonies used the Latin alphabet and the Japanese even created a variant of their katakana syllabary to write down the Ainu language spoken on Hokkaido (a similar system was set in place for a very short time in Taiwan, when it was a Japanese colony).

The majority of the work done on these orthographies was done by missionaries and anthropologists, who were compelled by their professions to have a good working knowledge of the local languages anyway. These orthographies tended to be morphophonetically transparent --- that is, not as plagued with the same inconsistencies and quirks that a writing system accumulates while being used over centuries by an ever-shifting spoken language. You write as you speak, at least until a local native literary tradition springs up and cements the written language into a more-or-less fixed form.

Speaking of orthographic conventions, designing an orthography is in many ways simpler than adopting a specific writing system; once you decide to use one alphabet over another, deciding how to write down the language with that alphabet is a step best left to linguists.

Some languages are better suited to certain orthographic methods over others:

Agglutinating languages --- in other words, languages that shove lots of prefixes and suffixes onto a root, making for some really long words. It is probably best to stick to the 'one phoneme, one grapheme' rule; for every sound (vowel, consonant or otherwise), you use one and only one letter. For this, you probably need to use diacritics, although they might be difficult to typeset (the rise of Unicode and computer-based publishing has thankfully made this problem much simpler to solve). Turkish is one example of this in action: every phoneme is neatly matched to one grapheme, and overall, it works very well for the Turkish language. The Finnish orthography is similarly blessed with feature.

Languages with large phonemic inventories --- these are difficult to design orthographies for. 'One phoneme, one grapheme' is nearly impossible to stick to without cluttering up the visual space with diacritics. You'll probably have to use letter combinations to represent single phonemes. Be careful with this, and make your letter combinations intuitive. You probably don't want to use the diagraph [qu] to represent kw if your language contains both [q] (as in Arabic; pronounced like [k], but much further back) and [w] as phonemes (Arabic has both, and the letter combination /qu/ would be pronounced like 'ko').

Languages with both agglutination and large phonemic inventories --- you're pretty much screwed. You'll have to use either a lot of diacritics, or a lot of letter combinations, and probably both. Many African and Native American languages face this problem, due to the structure of their languages. Still, KiSwahili (spoken in Eastern Africa) has a growing literary tradition, despite it having agglutinative tendencies and a respectable-sized phonemic inventory.

Languages with small phonemic inventories --- it's your lucky day. 'One phoneme, one grapheme' is extremely easy to stick to, and you probably don't need to use either letter combinations or diacritic marks. Hawai'ian is one example of this; the linguists must've thanked their lucky stars for the fact that Hawai'ian has a tiny phonemic inventory (only five vowels and eight consonants).

Languages with tones --- you're pretty much forced to use diacritics (see: Vietnamese), unless you want to use special letter combinations to indicate tone (see: Hmong).

Examples mentioned in the text (warning: these are all external links):

Hindi: http://www.omniglot.com/writing/hindi.htm

Urdu: http://www.omniglot.com/writing/urdu.htm

Serbian/Croatian/Bosnian: http://www.omniglot.com/writing/serbo-croat.htm

Turkish: http://www.omniglot.com/writing/turkish.htm

Ainu: http://www.omniglot.com/writing/ainu.htm (Japanese katakana-based transcription)

Taiwanese: http://en.wikipedia.org/wiki/Taiwanese_kana (Japanese katakana-based transcription)

Ubykh: http://www.omniglot.com/writing/ubykh.htm (an example of a language with a huge consonant inventory)

Hawai'ian: http://www.omniglot.com/writing/hawaiian.htm

Vietnamese: http://www.omniglot.com/writing/vietnamese.htm

Hmong: http://www.omniglot.com/writing/hmong.htm

Addendum (Feb. 8, 2009, 9:39 PM): Albert Herring brought up a good point: Hungarian breaks these patterns, by being an agglutinating language that uses both letter combinations and copious amounts of accent marks to map the Latin alphabet to its large phonemic inventory. To wit, it uses the acute accent to mark the long vowels á, é, í, ó and ú, the diaresis to mark the short front rounded vowels ö and ü and the double acute to mark the long front rounded vowels ő and ű. But it uses letter combinations for its consonant phonemes: sz, cs, ds and cz (among many, many others).

Also, he mentioned that Finnish does not adhere perfectly to the 'one phoneme, one grapheme'. It uses doubled consonant letters to mark geminates, while using doubled vowel letters to mark the long vowels: ää and öö. Double vowels are, strictly speaking, not the same as long vowels, but I would still say this is a decent compromise for the Finns, whose language has a smallish phonemic inventory. Thank you for pointing this out!

And as always, all my great and brilliant thoughts about sociolinguistics have already occurred to someone else, someone who is probably a lot smarter than I am.

A case of cheap booze	Shavian English alphabet	Hmong	An Alphabet for Gourmets
Neil Hamburger	sociolinguistics	berimbau	Evil and Sin in Dante's Inferno and Goethe's Faust: A Symbolic Comparison
Memorable lines from computer games	feedback loop	Geminate	diacritic
binge	missionary	kludge