character set - Everything2.com

A character set is a mapping between visual symbols representing letters of the alphabet, digits, punctuation, mathematical notation, etc. and numeric codes suitable for processing on a computer and storing on digital media.

In pre-computer days of printing, the notions of characters and typeface were combined in printers' fonts¹. A font comprised a series of blocks of molded metal, each of which had a face which was cut out in the shape of a particular letter or symbol (hence typeface). These were arranged to form words, sentences, paragraphs, etc. The unused fonts were kept in cases, one case for small letters, and a separate case for capital letters, hence the terms lower case, upper case.

With the advent of computers, it became apparent that representing words and symbols as numbers was extremely useful. The concept was already in use with cyphers in military intelligence. Using a machine to encrypt and decrypt secret documents was clearly more secure, faster and less error prone than humans doing this by hand. The most famous machines were the Enigma machines used by the Germans in the second world war.

Although cryptography is important, mapping of symbols into numbers proves much more general purpose and useful through the concepts of representation and storage. It is something that we take for granted, that a sequence of numeric codes can represent a string of characters, i.e. a piece of text. One attribute of a character set is the number of bits used to store a character.

What is the minimum number of bits we need?

Clearly, representing the digits 0-9 requires 4 bits. We could store numbers as a sequence of decimal digits, using one of the remaining bit patterns to represent a '-' sign. This is called binary coded decimal or BCD.

To represent the letters of the alphabet can be done with 5 bits, but this does not give us enough distinct combinations to represent both the letters and the digits.

One of the first commercial character sets was called SIXBIT, which naturally used 6 bits.


       L E A S T   S I G N I F I C A N T   B I T S
       000  001  010  011  100  101  110  111
M 000  Sp    !    "    #    $    %    &    '
O      (0)  (1)  (2)  (3)  (4)  (5)  (6)  (7)
S 001   (    )    *    +    ,    -    .    /
T      (8)  (9)  (10) (11) (12) (13) (14) (15)
  010   0    1    2    3    4    5    6    7
S      (16) (17) (18) (19) (20) (21) (22) (23)
I 011   8    9    :    ;    <    =    >    ?
G      (24) (25) (26) (27) (28) (29) (30) (31)
. 100   @    A    B    C    D    E    F    G
       (32) (33) (34) (35) (36) (37) (38) (39)
B 101   H    I    J    K    L    M    N    O
I      (40) (41) (42) (43) (44) (45) (46) (47)
T 110   P    Q    R    S    T    U    V    W
S      (48) (49) (50) (51) (52) (53) (54) (55)
  111   X    Y    Z    [    \    ]    ^    _
       (56) (57) (58) (59) (60) (61) (62) (63)

SIXBIT was used by Digital Equipment Corporation, on the PDP-10 range of machines, which had a 36 bit word size. SIXBIT enabled 6 characters to be stored in a word. Later, they moved to 7 bit ASCII, storing 5 characters to a word, and reserving the 36th bit as a parity bit.

ICL also used SIXBIT on their 1900 series, which had a 24 bit word size. They used code 60 to represent a Sterling pound sign (£), instead of a backslash.

All the early programming languages, Cobol, Fortran, Algol and others could be programmed using SIXBIT; C would be impossible, and Pascal would lose its comment braces.

ASCII

A consortium of computer manufacturers, the Teletype Corporation, and the American National Standards Institute formed a standard character set, which came to be known as ASCII - the American Standard Code for Information Interchange.

The character set contains a block of the 32 lowest numbered codes, which were originally reserved for implementing protocols; these are the control characters. There is one other non-printable character in 7 bit ASCII, code 127 (DEL), sent by the Teletype Rub-out key.

IBM were not part of this consortium, and insisted on using their own character set called EBCDIC - Extended Binary Coded Decimal Interchange Code. EBCDIC was an 8 bit code, based on the 4 bit BCD representation for digits (see above), inserting characters in the remaining 8 bit combinations. Unlike ASCII, the alphabet does not form a single contiguous, in-order, block of numeric codes.

Internationalisation (I18n)

ASCII became ubiquitous throughout the Western world, with the advent of the PC. But, this has brought its own issues of multinational variation. As distinct from USASCII, the first variant was UKASCII, which replaced code 35 (#) with a Sterling pound sign (£), but was in all other respects identical to USASCII. This is, I believe, the source of people referring to the hash, sharp or gate symbol as a pound.

With European languages, such as French and German, came the need for accented letters. It was undesirable to lose any of the symbols from the ASCII set, as extant programming languages were making use of the whole set, so the character set expanded to 8 bits.

With the highest bit off was straight USASCII. When the eightth bit is on, this makes available 128 additional character codes. To allow for Greek, Cyrillic, Arabic, Turkish, etc. characters, together with mathematical symbols and wingdings, the notion was to have a number of distinct character sets, as it is unlikely that one would need to swap between them inside a document.

This has the serious drawback of requiring an external context (character-set) to decode a piece of text; if the wrong character set is used, the results will be a meaningless mess of hieroglyphics. The HTTP and MIME standards allow for this meta information to be passed, allowing the viewing of international documents with a browser. However, many email clients do not implement this properly.

Also, the concept of locales (a group of settings where you define your country, language, character set, time zone, date and time format, currency, etc.) has highlighted a coding practice and I18n compliance issue. Much code has been written which relies on the alphabet being a contiguous block of 26 characters. This will fail on International locales. For example:


C:

int isLetter(char sym) {
    return (((sym >= 'A') && (sym <= 'Z')) ||
            ((sym >= 'a') && (sym <= 'z')));
}

Perl:

/[A-Za-z][a-z]*/;   # Match a word

Enter Unicode

It is not possible to represent pictographical languages such as Chinese and Japanese in 8 bits, although phonetic subsets are possible, such as the Katakana character set. However, 16 bits (65536 combinations) should be sufficient to accommodate all the international symbols in the same character set.

The Unicode standard attempts to define this, again using the first 127 codes as USASCII. This has highlighted more issues, with many existing programs that rely on 1 byte = 1 character.

Will 16 bits be enough for a universal character set?

This appears to meet current requirements. However, just like the allocation of telephone numbers, future uses are not always designed in at the time standards are created.

Will 16 bits be enough? Time will tell.

Notes

¹DoctorX's writeup font goes into more detail on the etymology and changes in usage of this word.

URL escape sequences	Table of ASCII Characters	ASCII	Unicode
Supplemental Arrows A	font	ISO 8859-14	ISO 8859-15
ISO 8859-1	Trigraph	DTD	ISO 8859-10
binary coded decimal	KOI8	Latin-1	ANSI art
Why don't I have votes today?	How to read binary-coded ASCII	Creating a Commodore 64 character set	Control Pictures
ISO 8859-13	ISO 8859-16	computer typography	Using Unicode on E2