Let's see why.
Suppose we (Bob) want to send to Alice over the Atlantic the super
secret PHP Manual which is 4.675.494 bytes long. We have the
freeware gpg which we will use for encryption. It implements most of
the most known encryption algorithms including the new AES (Rijndael). It is obvious that files that are human
readable have a great amount of redundancy, since human language is
highly redundant (that's why you can sometimes hear half a sentence
and fully understand it). Redundancy is easy to find out, using a
compression tool, like bzip2. The higher the compression factor,
the higher the redundancy. Let's see the manual's byte-frequency
analysis:
dogganos@postmortem:~/temp/testbed$analyse_file
PHP_Manual.html 1
data size is 4675494 bytes.
9 -> 0.025 %
10 -> 6.309 % ######
32 -> 10.766 % ##########
34 -> 2.801 % ##
35 -> 0.622 %
36 -> 0.168 %
37 -> 0.050 %
38 -> 0.755 %
39 -> 0.074 %
40 -> 0.414 %
41 -> 0.414 %
42 -> 0.020 %
43 -> 0.012 %
44 -> 0.399 %
45 -> 0.467 %
46 -> 0.829 %
47 -> 2.158 % ##
48 -> 0.477 %
49 -> 0.592 %
50 -> 0.431 %
51 -> 0.344 %
52 -> 0.176 %
53 -> 0.076 %
54 -> 0.231 %
55 -> 0.053 %
56 -> 0.108 %
57 -> 0.055 %
58 -> 0.079 %
59 -> 0.876 %
60 -> 4.077 % ####
61 -> 1.422 % #
62 -> 4.077 % ####
63 -> 0.025 %
65 -> 2.003 % ##
66 -> 0.760 %
67 -> 1.081 % #
68 -> 1.412 % #
69 -> 1.323 % #
70 -> 0.434 %
71 -> 0.255 %
72 -> 0.605 %
73 -> 1.079 % #
75 -> 0.024 %
76 -> 1.347 % #
77 -> 0.315 %
78 -> 0.580 %
79 -> 0.404 %
80 -> 1.280 % #
81 -> 0.034 %
82 -> 0.659 %
83 -> 1.494 % #
84 -> 1.543 % #
85 -> 0.130 %
86 -> 0.473 %
87 -> 0.091 %
88 -> 0.046 %
89 -> 0.026 %
91 -> 0.038 %
92 -> 0.030 %
93 -> 0.039 %
95 -> 0.562 %
97 -> 2.814 % ##
98 -> 0.966 %
99 -> 1.924 % #
100 -> 1.268 % #
101 -> 4.817 % ####
102 -> 1.528 % #
103 -> 0.728 %
104 -> 1.158 % #
105 -> 3.058 % ###
106 -> 0.052 %
107 -> 0.168 %
108 -> 1.647 % #
109 -> 1.106 % #
110 -> 3.887 % ###
111 -> 2.875 % ##
112 -> 1.530 % #
113 -> 0.119 %
114 -> 2.790 % ##
115 -> 3.034 % ###
116 -> 3.860 % ###
117 -> 1.481 % #
118 -> 0.351 %
119 -> 0.446 %
120 -> 0.247 %
121 -> 0.571 %
122 -> 0.054 %
123 -> 0.020 %
125 -> 0.020 %
Standard deviation of byte frequencies is 0.010914
dogganos@postmortem:~/temp/testbed$
From the 256 bytes, just 89 are used in the
manual. That's because in a human readable document, only the human
readable characters are used, and that's the alphabet, the numbers,
space, carriage return, etc. All the rest of the bytes have a
zero frequency. That's why the frequencies have a relatively (as we
will see below), high standard deviation, because they differ much
from each other.
Now, let's encrypt the manual as is,
uncompressed:
dogganos@postmortem:~/temp/testbed$gpg -z 0 -c PHP_Manual.html
dogganos@postmortem:~/temp/testbed$l
total 9516
-rwx------ 1 dogganos dogganos 4675494 Oct 4 2001
PHP_Manual.html
-rw-r--r-- 1 dogganos dogganos 4676722 Dec 21 21:48
PHP_Manual.html.gpg
dogganos@postmortem:~/temp/testbed$
The "-z 0" switch instructs gpg not to use compression. The
resulting file PHP_Manual.html.gpg is thus slightly bigger. But
let's analyse it:
dogganos@postmortem:~/temp/testbed$analyse_file
PHP_Manual.html.gpg
data size is 4676722 bytes.
0 -> 0.389 %
1 -> 0.392 %
2 -> 0.390 %
3 -> 0.390 %
4 -> 0.391 %
5 -> 0.389 %
6 -> 0.394 %
7 -> 0.395 %
8 -> 0.389 %
9 -> 0.390 %
10 -> 0.393 %
11 -> 0.388 %
12 -> 0.391 %
13 -> 0.387 %
14 -> 0.389 %
15 -> 0.392 %
16 -> 0.390 %
17 -> 0.389 %
18 -> 0.391 %
19 -> 0.389 %
20 -> 0.391 %
21 -> 0.386 %
22 -> 0.384 %
23 -> 0.395 %
24 -> 0.389 %
25 -> 0.393 %
26 -> 0.389 %
27 -> 0.393 %
28 -> 0.393 %
29 -> 0.392 %
30 -> 0.395 %
31 -> 0.395 %
32 -> 0.391 %
33 -> 0.389 %
34 -> 0.390 %
35 -> 0.395 %
36 -> 0.394 %
37 -> 0.390 %
38 -> 0.389 %
39 -> 0.385 %
40 -> 0.393 %
41 -> 0.390 %
42 -> 0.393 %
43 -> 0.391 %
44 -> 0.389 %
45 -> 0.392 %
46 -> 0.395 %
47 -> 0.389 %
48 -> 0.388 %
49 -> 0.389 %
50 -> 0.392 %
51 -> 0.389 %
52 -> 0.391 %
53 -> 0.399 %
54 -> 0.401 %
55 -> 0.391 %
56 -> 0.387 %
57 -> 0.394 %
58 -> 0.388 %
59 -> 0.390 %
60 -> 0.391 %
61 -> 0.392 %
62 -> 0.398 %
63 -> 0.387 %
64 -> 0.393 %
65 -> 0.388 %
66 -> 0.391 %
67 -> 0.391 %
68 -> 0.384 %
69 -> 0.384 %
70 -> 0.385 %
71 -> 0.390 %
72 -> 0.387 %
73 -> 0.395 %
74 -> 0.390 %
75 -> 0.391 %
76 -> 0.393 %
77 -> 0.391 %
78 -> 0.395 %
79 -> 0.392 %
80 -> 0.389 %
81 -> 0.393 %
82 -> 0.389 %
83 -> 0.388 %
84 -> 0.390 %
85 -> 0.385 %
86 -> 0.395 %
87 -> 0.389 %
88 -> 0.387 %
89 -> 0.387 %
90 -> 0.386 %
91 -> 0.390 %
92 -> 0.391 %
93 -> 0.382 %
94 -> 0.388 %
95 -> 0.390 %
96 -> 0.389 %
97 -> 0.393 %
98 -> 0.389 %
99 -> 0.386 %
100 -> 0.386 %
101 -> 0.394 %
102 -> 0.388 %
103 -> 0.386 %
104 -> 0.393 %
105 -> 0.389 %
106 -> 0.395 %
107 -> 0.390 %
108 -> 0.392 %
109 -> 0.391 %
110 -> 0.393 %
111 -> 0.392 %
112 -> 0.392 %
113 -> 0.391 %
114 -> 0.385 %
115 -> 0.398 %
116 -> 0.388 %
117 -> 0.387 %
118 -> 0.389 %
119 -> 0.392 %
120 -> 0.395 %
121 -> 0.392 %
122 -> 0.395 %
123 -> 0.392 %
124 -> 0.390 %
125 -> 0.389 %
126 -> 0.388 %
127 -> 0.387 %
128 -> 0.390 %
129 -> 0.394 %
130 -> 0.391 %
131 -> 0.395 %
132 -> 0.386 %
133 -> 0.397 %
134 -> 0.390 %
135 -> 0.389 %
136 -> 0.391 %
137 -> 0.395 %
138 -> 0.393 %
139 -> 0.396 %
140 -> 0.389 %
141 -> 0.393 %
142 -> 0.388 %
143 -> 0.391 %
144 -> 0.389 %
145 -> 0.386 %
146 -> 0.390 %
147 -> 0.390 %
148 -> 0.390 %
149 -> 0.386 %
150 -> 0.386 %
151 -> 0.391 %
152 -> 0.391 %
153 -> 0.391 %
154 -> 0.391 %
155 -> 0.396 %
156 -> 0.395 %
157 -> 0.396 %
158 -> 0.390 %
159 -> 0.390 %
160 -> 0.390 %
161 -> 0.397 %
162 -> 0.389 %
163 -> 0.388 %
164 -> 0.390 %
165 -> 0.394 %
166 -> 0.393 %
167 -> 0.391 %
168 -> 0.390 %
169 -> 0.389 %
170 -> 0.390 %
171 -> 0.391 %
172 -> 0.386 %
173 -> 0.389 %
174 -> 0.386 %
175 -> 0.396 %
176 -> 0.387 %
177 -> 0.391 %
178 -> 0.393 %
179 -> 0.396 %
180 -> 0.392 %
181 -> 0.393 %
182 -> 0.385 %
183 -> 0.389 %
184 -> 0.391 %
185 -> 0.387 %
186 -> 0.391 %
187 -> 0.390 %
188 -> 0.391 %
189 -> 0.387 %
190 -> 0.390 %
191 -> 0.393 %
192 -> 0.389 %
193 -> 0.394 %
194 -> 0.387 %
195 -> 0.389 %
196 -> 0.389 %
197 -> 0.393 %
198 -> 0.388 %
199 -> 0.394 %
200 -> 0.390 %
201 -> 0.388 %
202 -> 0.392 %
203 -> 0.396 %
204 -> 0.389 %
205 -> 0.388 %
206 -> 0.396 %
207 -> 0.385 %
208 -> 0.392 %
209 -> 0.392 %
210 -> 0.392 %
211 -> 0.390 %
212 -> 0.386 %
213 -> 0.396 %
214 -> 0.393 %
215 -> 0.388 %
216 -> 0.390 %
217 -> 0.392 %
218 -> 0.393 %
219 -> 0.389 %
220 -> 0.392 %
221 -> 0.391 %
222 -> 0.392 %
223 -> 0.388 %
224 -> 0.389 %
225 -> 0.391 %
226 -> 0.389 %
227 -> 0.383 %
228 -> 0.385 %
229 -> 0.391 %
230 -> 0.387 %
231 -> 0.395 %
232 -> 0.388 %
233 -> 0.389 %
234 -> 0.393 %
235 -> 0.394 %
236 -> 0.389 %
237 -> 0.405 %
238 -> 0.391 %
239 -> 0.395 %
240 -> 0.397 %
241 -> 0.392 %
242 -> 0.391 %
243 -> 0.393 %
244 -> 0.393 %
245 -> 0.394 %
246 -> 0.384 %
247 -> 0.391 %
248 -> 0.389 %
249 -> 0.389 %
250 -> 0.390 %
251 -> 0.392 %
252 -> 0.391 %
253 -> 0.388 %
254 -> 0.388 %
255 -> 0.389 %
Standard deviation of byte frequencies is 0.000032
dogganos@postmortem:~/temp/testbed$
We now see that all bytes are encountered in the
encrypted file, at about the same frequency: 0.39 %, and in addition,
the standard deviation is now minimal. That is because a good
encryption algorithm produces a byte stream where each byte has a
probability of 1/256 (~= 0.0039). This makes sense, because
cryptanalysis exactly exploits redundancies in the plaintext (or
data, or whatever). That the file looks like random data can also
be verified by trying to compress it:
dogganos@postmortem:~/temp/testbed$bzip2 -k
PHP_Manual.html.gpg
dogganos@postmortem:~/temp/testbed$l
total 14120
-rwx------ 1 dogganos dogganos 4675494 Oct 4 2001
PHP_Manual.html
-rw-r--r-- 1 dogganos dogganos 4676722 Dec 21 21:48
PHP_Manual.html.gpg
-rw-r--r-- 1 dogganos dogganos 4698154 Dec 21 21:48
PHP_Manual.html.gpg.bz2
dogganos@postmortem:~/temp/testbed$
The resulting "compressed" file now became even
bigger! That is good for the encryption algorithm, because it means
that the data it encrypts have no redundancy. As Bruce Schneier says
in his Applied Cryptography, "This makes a reasonable test of an
encryption algorithm; if the ciphertext can be compressed, then the
algorithm probably isn't very good".
Let's try now the opposite. First compress and then encrypt.
dogganos@postmortem:~/temp/testbed$bzip2 -k
PHP_Manual.html
dogganos@postmortem:~/temp/testbed$l
total 14692
-rwx------ 1 dogganos dogganos 4675494 Oct 4 2001
PHP_Manual.html
-rwx------ 1 dogganos dogganos 581281 Oct 4 2001
PHP_Manual.html.bz2
-rw-r--r-- 1 dogganos dogganos 4676722 Dec 21 21:48
PHP_Manual.html.gpg
-rw-r--r-- 1 dogganos dogganos 4698154 Dec 21 21:48
PHP_Manual.html.gpg.bz2
dogganos@postmortem:~/temp/testbed$gpg -c -z 0 PHP_Manual.html.bz2
dogganos@postmortem:~/temp/testbed$l
total 15264
-rwx------ 1 dogganos dogganos 4675494 Oct 4 2001
PHP_Manual.html
-rwx------ 1 dogganos dogganos 581281 Oct 4 2001
PHP_Manual.html.bz2
-rw-r--r-- 1 dogganos dogganos 581515 Dec 23 21:28
PHP_Manual.html.bz2.gpg
-rw-r--r-- 1 dogganos dogganos 4676722 Dec 21 21:48
PHP_Manual.html.gpg
-rw-r--r-- 1 dogganos dogganos 4698154 Dec 21 21:48
PHP_Manual.html.gpg.bz2
dogganos@postmortem:~/temp/testbed$
Aha! The compression really squeezed it! Now, the
resulting encrypted file PHP_Manual.html.bz2.gpg gives a wanna-be
cryptologist far less stuff to work with, and also saves us time/space,
as it is far less time consuming to transfer/store 500 KB, than 4 MB.
The non-compressibility of random data (like ciphertext) can also be used in the following way: to detect encryption. Suppose that we snoop all packets that Eve sends to Bob, but since they have frequent correspondence, we want only to catch what is really interesting. So we suppose that they encrypt everything that they send to each other and is really interesting (say, 'We will meet at 00:45 behind the old train-station'). In that case, what we can do, is try to compress messages that we snoop, and gather the messages that do not compress significantly. They are probably encrypted*, thus interesting. We break the encryption(...) et voila!
Although, says Schneier again, it is always possible to specifically produce compressible ciphertext**... Ahhh, these mathematicians...
* Actually, these messages that do not compress significantly may also be already compressed. So, before trying to cryptanalyse them, we should try to decompress them with a variety of known compression algorithms (bzip2, zip, rar etc).
** randombit noted that an easy way to create compressible ciphertext is to take some redundant data, decrypt it with a key, and then use it as the plaintext (with the same key).