The goal of the Levenshtein distance, named after the Russian
scientist Vladimir Levenshtein (1) (also called edit distance or
edition distance) is to measure the similarities between two
strings (sequences of elements).
One string can be transformed into another thanks to three operations :
insertion of a character, deletion of a character and substitution of a
character by another.
A sequence of operations which transform one string
into another is called an edit sequence.
The Levenshtein distance is the smallest edit sequence.
The algorithm described below gives both the distance and all the
shortest edit sequences.
The most frequent usage of this algorithm is probably spell checking.
A basic spell checker will lookup every word of a text in a dictionary.
If the word is present then it is correctly spelled.
Otherwise it gives the user a list of words of the dictionary,
those with the smallest distance to the misspelled word.
Search engines such as Google also use this method (or an optimized
version of it) on keywords submitted to give more accurate results.
Since this method measures similarity, it is also used in the areas of
shape recognition, speech recognition (probably using soundex techniques as
well), plagiarism detection or even in the field of Computational Biology with DNA analysis (DNA is a sequence of A, C,
G and T proteins).
The basic exponential algorithm
Suppose you wish to compute the distance d(s1,s2) between two
strings s1 and s2.
The idea is to express d(s1,s2) as a function of d(p,q)
where p and q are smaller substrings of s1 and s2 in order
to get a recursive algorithm (which will end since a string
has a positive length).
- If s1=="" (resp. s2==""), the distance is obviously the number of
characters of s2 (resp. s1).
- If s1!="" and s2!="", then those strings can be written
s1=s1'+c1 and s2=s2'+c2 where c1 (resp. c2) is the last character
of s1 (resp. s2).
- If c1==c2, then obviously d(s1,s2) = d(s1',s2')
- If c1!=c2, then d(s1,s2) = min( d(s1',s2'+c2), d(s1'+c1,s2') )
The last point needs to be elaborated on. If the last character of
s1 and s2 isn't the same, then the shortest edit sequence
from s1 to s2 involves either deleting the last character of s1
(ie. d(s1',s2'+c2)) or deleting the last character of s2
(ie. d(s1'+c1,s2')). Since we are interested in the minimum, this
explains the use of the min function.
Because it is needed to compute two distances on shorter strings
at almost each step, the complexity of this algorithm is exponential
which renders it only of theoretical interest.
A nicer algorithm : Needleman Wunsch algorithm
There exists a dynamic programming version of the algorithm known as the Needleman Wunsch algorithm
which has a complexity of O(n2)
if n is the size of the strings s1 and s2 which makes it far more practicable
than the above algorithm obtained by induction.
The idea is to notice that in the above algorithm some distance computations
can be performed several times. Say you wish to compute d(bbcd, abdc), then
since the last character differs, you will need d(bbcd,abd) and d(bbc,abdc).
The first leads to compute d(bb,ab) and d(bbc,a). The second
d(b,abd) and d(bb, ab). As you can see, d(bb, ab) is computed more than
once which is a waste of time.
Notations : s(i..j) is the substring formed of characters i through j
of string s. s(i) is the character number i of s.
The enhanced algorithm proceeds by filling a two dimensional matrix M,
where M(i,j) is the distance between substrings s1(1..i) and s2(1..j).
- Create a matrix M of size (|s1|+1, |s2|+1).
- M(i,0) = i for i=0..|s1| since we know those distances
- M(0,j) = j for j=0..|s2| since we also know those distances
- For the rest of the matrix, M(i,j) = min of :
- M(i-1, j-1) + 1 if s1(i)!=s2(j), M(i-1,j-1) otherwise. This
corresponds to the substitution of characters s1(i) and s2(i).
- M(i-1, j) + 1. This corrsponds to the insertion of character s1(i)
or equivalently the deletion of s2(j).
- M(i,j-1) + 1. This corresponds to the insertion of character s2(i)
or equivalently the deletion of s1(i).
Levenshtein distance in action
One important thing to notice is that the costs of a deletion, insertion
or substitution can be changed to have a different value than 1. For example
say you wish to correct typos, it might be a good idea to suppose that
missing characters are very frequent (you don't press the key hard enough
because you type too fast), substitutions are frequent (you press the wrong key)
and insertions are rare. The cost of a deletion would then be 3, that of a
substitution 2 and that of an insertion 1.
You can even further extend this method by assigning variable costs for
substitution based on the metric distance between the keys.
The Unix command diff uses a Levenshtein-distance-like algorithm
to produce edit scripts. The typical use of diff is to apply
patches to source code files. This is done by applications such as CVS.
Instead of packaging the whole new file, the CVS server only sends a
small diff file which instructs the CVS client which lines to remove,
add or modify in the source code.
The Smith Waterman algorithm is a nice variant which compares all substrings and returns the two with the highest similarity score and runs in O(n2). Thanks to ariels for that addition
Closing words...
Levenshtein distance algorithm isn't the best to compute the distance
or find the shortest sequence of operations needed to transform one
string into another. However it is a good starting point since it is
really easy to implement.
The algorithm described above also gives the edit sequences. Follow
a path from the lower right corner to the upper left one, only moving
north, west and northwest (insertion, deletion and
substitution) with one constraint : the sequence of numbers you land on
must be decreasing.
Sources :
(1) V. I. Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. Doklady Akademii Nauk SSSR 163(4) p845-848, 1965
http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Dynamic/Edit/ - a very nice web page
http://www.merriampark.com/ld.htm - has a Java applet that implements the algorithm, C++ and VB source code as well