Base64 is a
content-
transfer-
encoding method to represent sequences of
octets using a
subset of the
character set, initially developed to allow binaries to slip through
mailers unharmed.
Unlike uuencode or base85 encoding (used by Level 2 Postscript), the subset of characters used by base64 are represented identically by ISO 646 (the 7-bit ASCII standard) and all versions of EBCDIC. These characters (and their index, referenced below) are:
0 A 17 R 34 i 51 z
1 B 18 S 35 j 52 0
2 C 19 T 36 k 53 1
3 D 20 U 37 l 54 2
4 E 21 V 38 m 55 3
5 F 22 W 39 n 56 4
6 G 23 X 40 o 57 5
7 H 24 Y 41 p 58 6
8 I 25 Z 42 q 59 7
9 J 26 a 43 r 60 8
10 K 27 b 44 s 61 9
11 L 28 c 45 t 62 +
12 M 29 d 46 u 63 /
13 N 30 e 47 v
14 O 31 f 48 w (pad) =
15 P 32 g 49 x
16 Q 33 h 50 y
Table lifted from RFC 1521.
Note the 65th character, =, is used for padding at the end of the data.
Using 64 characters allows us to represent 6 bits per printable character. To encode our stream of data, 24 bits (3 octets) are pulled off at a time and split into 6 bit chunks (I'll use wharfinger's example from his uuencode writeup in order to illustrate the parallel):
00101101 11010010 00100010
Becomes
001011 011101 001000 100010
In decimal:
11 29 8 34
Which (using the above table) would be represented as the characters:
L d I i
An astute reader will wonder what happens when fewer than 24
bits are available at end of the data. Well, Astute Reader, remember that the
input must be an
integral number of
octets, so there are only three possibilities:
- The final quantum is made up of 3 octets (24 bits) -- no padding is necessary.
- The final quantum is made up of 2 octets (16 bits) -- the encoded output will consist of three characters and marked with one '=' to indicate the amount of padding added. This is done as follows:
00101101 11010010
Is split into:
001011 011101 001000
^^
Note that the last grouping must be padded to six bits. In base64 output, the above 16 bits would be represented by, 'Ld8='.
- The final quantum is made up of 1 octet (8 bits) -- the encoded output will consist of two characters and marked with two '=' characters:
00101101
Becomes:
001011 010000
^^^^
Again, the last grouping must be padded to six bits. In base64 output, the above 8 bits would be represented by, 'LQ=='.
Note that 24
bits are represented by 32
bits in plain
text (4
characters * 8
bits/
character) -- this translates to approximately a 33% increase in file size.