The exact order of frequency
given for English varies slightly from source to source. The one I memorize
d as a child was ETAON RISHDLU CMFGY PWBV KXJQZ, an easily pronounceable mnemonic
(honestly), but it would be easier to remember the start, and therefore the most important ones, as ETAOIN SHRDLU
, because this nonsense
text is what you occasionally used to see in newspapers. Compositor
s had their trays (the upper case
and the lower case
) laid out in this order for ease of reach, and when they needed to space out a line they took one of each. If they forgot to remove it before printing, you saw it like this.etaoinshrdlu
thbz's node Letter frequency in several languages has it starting ETAONI HSRDLU, and gives actual percentages. Clearly E is out in front with 12.4%, T a clear second at 8.9%, and the rest behind. I once did this by hand but these days we have computers and e-texts to feed into them.
There is not much that could alter frequencies; these apply more or less over any kind of text. Scientific texts might use Y a bit more in terms taken from Greek, but on the other hand won't contain the common Y in happy, muddy. Z is extremely rare whether you write realise or realize, so that doesn't make much difference.
I haven't got figures for frequency of pairs, but thbz's node gives some URLs where they might be found. In English TH and WH are very common. (H and W are rare in most languages.) IS and AT occur in this and that as well as separately. The commonest initial letters are TAOSW. That's not according to how thick their section of the dictionary is, but from the constant repetition of TO, THE, AS, AND, AT, ON, OR, WHO, WHEN etc.
You begin breaking a frequency cipher by assuming the commonest letter is E, then the next-commonest is T, then applying the above English language patterns to see if you can find words. With luck, and if it's monoalphabetic, the job is done: it'll all fall into place from here. I once impressed a cocky schoolfellow by cracking the pattern ABC DEFF GHIHJ KHL EL in no time flat. Even a text that short followed the frequencies.
This easy example preserved the spacing, which makes it a dead giveaway: first there are only two one-letter words in English; then if you've guessed at E or T by frequency and that gives a two-letter word ?E or ?T, you can try BE, ME, IT, and AT; a three-letter word T?? is almost certainly THE, which will give you an H to find THIS, THAT, WHO; and so on.
Of course only the rankest amateur would leave the spacing as it was; or use apostrophes, so that ABC'DD gives you 'LL straight away. But the same principles make words leap out of unspaced text: TH and WH might come from HEALTHY or GET HER or SAW HIM, but are far more likely to be initial. The commonest three-letter sequence will translate as THE, which is itself the single commonest word but also occurs in such common words as THEN, THERE, OTHER.
The same principles apply to polyalphabetic ciphers, with increase in difficulty. A simple polyalphabetic substitution cipher occurs at the end of Red Shift by Alan Garner: preservation of punctuation and spacing, plus an easy guess at the opening three-word sentence, makes this easy to crack, but simple frequency analysis no longer works. For a more general solution see Breaking the Vigenère Square.
Thanks to Gethsemane for spotting the deliberate error ahem.