frequency analysis - Everything2.com

by Xamot

Mon Jan 24 2000 at 15:33:26

Used in cryptanalysis to identify possible values for the letters/icon/pictures in the encrypted text when a substitution cipher is used. The idea is that the most common character is probably one of the most common letters in the suspected plaintext language.

Take this a step forward and you can analyze how letters are related to each other (what letters can be found next to it in the language and how often they are found together). Using this information you can make better guesses at which characters represent which letters by how sociable they are to other characters. For example if a character is found next to many different ones it is probably a vowel because vowels are very sociable letters. If the character is found next to a limited number of vowels or in limited combinations then it is probably a consonant.

Take it another step and you can analyze sequences of letters and how often they are found together.

Basically an analysis of the plaintext language must be performed and statistical information gathered to apply educated guesses based on the frequency of letters and letter groupings.

One last thing: frequency anlysis was discovered by the Arab scientist Abu Yusuf Ya`qub ibn Is-haq ibn as-Sabban ibn `omran ibn Ismail al-Kindi (or Al-Kindi for short) in the 9th century.

I like it!

1 C!

(idea)

by Sputnik

Thu Aug 10 2000 at 21:46:59

The letters of the English language, arranged by decreasing frequency, are:

E, T, N, R, I, O, A, S, D, H, L, C, F, P, U, M, Y, G, W, V, B, X, K, Q, J, Z

Playing the Alphabet Game on long car trips as a kid is a good way to acquire an intuitive knowledge of this.
Other language characteristics to look for when using frequency analysis include digraphs, or letters commonly written together.

I like it!

(idea)

by Gritchka

Sat Feb 17 2001 at 13:34:44

The exact order of frequency given for English varies slightly from source to source. The one I memorized as a child was ETAON RISHDLU CMFGY PWBV KXJQZ, an easily pronounceable mnemonic (honestly), but it would be easier to remember the start, and therefore the most important ones, as ETAOIN SHRDLU, because this nonsense text is what you occasionally used to see in newspapers. Compositors had their trays (the upper case and the lower case) laid out in this order for ease of reach, and when they needed to space out a line they took one of each. If they forgot to remove it before printing, you saw it like this.etaoinshrdlu

thbz's node Letter frequency in several languages has it starting ETAONI HSRDLU, and gives actual percentages. Clearly E is out in front with 12.4%, T a clear second at 8.9%, and the rest behind. I once did this by hand but these days we have computers and e-texts to feed into them.

There is not much that could alter frequencies; these apply more or less over any kind of text. Scientific texts might use Y a bit more in terms taken from Greek, but on the other hand won't contain the common Y in happy, muddy. Z is extremely rare whether you write realise or realize, so that doesn't make much difference.

I haven't got figures for frequency of pairs, but thbz's node gives some URLs where they might be found. In English TH and WH are very common. (H and W are rare in most languages.) IS and AT occur in this and that as well as separately. The commonest initial letters are TAOSW. That's not according to how thick their section of the dictionary is, but from the constant repetition of TO, THE, AS, AND, AT, ON, OR, WHO, WHEN etc.

You begin breaking a frequency cipher by assuming the commonest letter is E, then the next-commonest is T, then applying the above English language patterns to see if you can find words. With luck, and if it's monoalphabetic, the job is done: it'll all fall into place from here. I once impressed a cocky schoolfellow by cracking the pattern ABC DEFF GHIHJ KHL EL in no time flat. Even a text that short followed the frequencies.

This easy example preserved the spacing, which makes it a dead giveaway: first there are only two one-letter words in English; then if you've guessed at E or T by frequency and that gives a two-letter word ?E or ?T, you can try BE, ME, IT, and AT; a three-letter word T?? is almost certainly THE, which will give you an H to find THIS, THAT, WHO; and so on.

Of course only the rankest amateur would leave the spacing as it was; or use apostrophes, so that ABC'DD gives you 'LL straight away. But the same principles make words leap out of unspaced text: TH and WH might come from HEALTHY or GET HER or SAW HIM, but are far more likely to be initial. The commonest three-letter sequence will translate as THE, which is itself the single commonest word but also occurs in such common words as THEN, THERE, OTHER.

The same principles apply to polyalphabetic ciphers, with increase in difficulty. A simple polyalphabetic substitution cipher occurs at the end of Red Shift by Alan Garner: preservation of punctuation and spacing, plus an easy guess at the opening three-word sentence, makes this easy to crack, but simple frequency analysis no longer works. For a more general solution see Breaking the Vigenère Square.

Thanks to Gethsemane for spotting the deliberate error ahem.

I like it!

2 C!s

Breaking the Vigenère cipher	Letter frequency in several languages	90% of people think they are of above average intelligence	Alphabet Death Game
digraph	A more mature Nintendo	Security Code: USA Eyes Only #Everything User / UK Security Civil-3 Defense-5:	How many men/women masturbate?
etaoin shrdlu	Cryptanalysis	probable word attack	monoalphabetic substitution cipher
The Arabian contribution to Cryptology	Red Shift	substitution cipher	pattern
encrypted	e	frequency divider	The Alphabet Game
Vigenère Stream	Yule's Characteristic	Alan Garner	English language letter frequencies