What now seems like long ago, I was writing my first CGI. This was before I even heard of perl (nor had most of the world) and CGIs were written in C. This CGI was a jargon file searcher and I knew that my seplling sucked and I needed a spelling approximator in it to find near matches.

While pondering this in a class titled "Life In The Past", the lecture was about the genetic code and how mutations happen.

Within a strand of DNA, there are several types of mutations that can happen. It is possible for a base pair to be deleted, or inserted, or one substituted for the other. A substitution in the gene may not actually cause a mutation because the same amino acid is represented by several sequences. However, a insertion or deletion of a base pair will cause the entire sequence to be shifted, possibly coding a radically different proteins.
At this point, I realized that a string of characters is not that different from a string of base pairs (well, a bit larger alphabet) and these are the mistakes that happen in spelling.

Well, there's actually one more type of spelling error that people make, and that is the transposition of two characters 'hte' instead of 'the' - its not that uncommon.

  1. Given two lower case strings (the 'key' and the 'matcher')
  2. Allocate an array of integers that is the length of the longest one, initialized to '-10'
  3. Iterate through the matcher char by char:
    1. 'mpos' is the position of the matching character in the matching string
    2. 'kpos' is the first position of the matching character in the key string
    3. Array[kpos] = mpos;
    4. The character at 'kpos' is set to a space or null (thus preventing it from being matched again.
  4. At this point, there is an array with a bunch of numbers in it.
  5. Walk the array looking for the patterns:
    • 1 2 - a perfect match (15 points)
    • 1 3 - a deleted character (1 point)
    • 2 1 - transposed characters (3 points)
    • 1 ? 2 - an inserted character (5 points)
    • 1 ? 3 - a substituted character (3 points)
  6. Count the number of matches (an array value > 0) in the array (1 point each)
  7. Count the number of '-10' in the array (-2 points each)
  8. Return the sum of the points

What you say!! And all this means?

Lets take two words - 'foobar' and 'fubar' and get a number with 'fubar' as the key.

matcher: foobar
key:     fubar
array:   1_456_
pat       score
     45 = 15
     45 = 15
    4?6 =  3
match   =  4
nomatch = -4
The value of this is in comparing the key to itself. This number can be calculated from the length (l) of the word (assume l > 2) :
(15 * (l-1)) + (3 * (l-2)) + l
Thus, for a five character word ('foobar'), the value would be:
(15 * 4) + (3 * 3) + 5 = 74

So how fast is this? And how good are the matches?

Realize the time I am measuring is that of user time on a Unix system. This more closely represents the actual amount of time spent crunching rather than opening files (part of the system time). Because a Unix machine is preemptive multitasking, the wall time (how the 'apparent' time to run) is not the best indicator because other things happen (such as the java chatterbox).

Run against the standard linux dictionary (45407 words) on a moderate speed system (so its 3 years old) looking for the word 'thisisatest', it takes 0.8 user seconds to come up with the list:

  1. distastes
  2. substrates
  3. thistle
  4. instigates
  5. dissipates

Run against a larger dictionary, the time scales linearly - the 111284 word dictionary takes about three times longer (2.1 user seconds) and produces the list:

  1. sophisticates
  2. thiosulfate
  3. thionate
  4. thiophosphate
  5. distastes

So maybe 'thisisatest' is not the best choice - still, you can 'see' it trying to get a match with a word that does not exist. Furthermore, the entire thing should have a look at the values assigned and the 'key' vs 'matcher' role.