The distance separating two character strings can be seen as the minimum number of elementary alterations to apply to a string in order to obtain a second one (cf. Wagner et Fisher 1974, and Algorithm 1). Each alteration is assigned a cost, and the sequence of alterations that gives the lower cost is sought. The elementary alterations are:

  • suppression (S)
  • insertion (I)
  • change (C)

Algorithm 1: Edition distance from the X string to the Y string (Wagner-Fisher)

Required: D (0,0)=0
Required: n = length of the X string
Required: m = length of the Y string
Required: C(a,b) = cost of the alteration from a to b
Required: lambda = void
Assure : D (m,n) = edition distance from X to Y
  For i = 0...n Do
    For j = 0...m Do
      If i != j != 0 Then
        D (i,j) = Min (D (i-1, j-1) + C (Xi, Yj), D (Xi, lambda), D (i, j-1) + C (lambda, Yi))
      End If
    End For
  End For

Milgram Milgram 1993 gives the example of the transformation of aabcb into ababd. Figure 1 gives some of the pathes allowing to go from aabcb to ababd. Given a cost of 1 for each of the alterations, the edition distance from aabcb to ababd is three. It would be different if a suppression cost was 0.5, the cost of an insertion 0.75, and that of change 0.25. Its value would then be 1.5.

Figure 1: Some of the shortest pathes from aabcb to ababd.

                                   aabcb
                                  /  |   \
                                 S   I    C
                               /     |      \
                            abcb   aabcbd   aabab
                              |       \       |
                              |         \     |
                              C           C   I
                              |             \ |
                            abab           aababd
                                \          /
                                  I      S
                                   \    /
                                   ababd

For the BAsCET's application on bibliographic references recognition, each transformation cost was given a 0.6 value, except in some change case (C). When a capital changes to its lowercase, or the opposite, its alteration cost is given a value of 0.1. It is then considered that these are almost the same characters. Indeed, it happens that a symbol beginning with a capital in the references database was translated into a symbol beginning with a lowercase. When a letter has to be changed to another letter (not to a punctuation character, or other...), 0.9 is assigned to this cost. When a digit a has to be changed to another digit b, 6 |a-b| / 100 is assigned to the cost. Thus, the cost varies between 0 (when a = b) and 0.54 depending on the mathematical distance of the two digits. It's useful when a user made a key-mistake (what's the real word? /msg me, please, if you know), but mostly when an agent looks for a number with the same length, but with a different value. A typical example is the search for a year: it is possible that the reference's year does not exist in Concept Network (in the database), but it is nevertheless useful that this year was recognized as a year, that is to say as a near neighbor of a year in the Concept Network.

The distance d between two strings a and b allows to easily compute the similarity rate s(a,b), depending also on their lengths (l(a) and l(b)): s(a,b) = ( 1 - d(a,b) ) / MAX( l(a), l(b) ) (see search for similar sub-strings).


Bibliography

Wagner et Fisher1974
R. A. Wagner et M. J. Fisher.
The string to string correction problem.
Communications of the ACM, 20(10), May 1974.
Milgram 1993
M. Milgram.
Reconnaissance des formes -- Méthodes numériques et connexionnistes.
Armand Colin, 1993.

A concept named for Vladimir Levenshtein, who published the algorithm in the 1966 paper "Binary codes capable of correcting deletions, insertions and reversals".

Levenshtein distance is measure of the distance between two data sets in terms of data insertion, deletion, or substitution. Each edit activity that required to transform data set a into data set b is assigned a point value. Relative point values between different data sets can then be used to find data sets that are most similar.

For example, if you assign one point to each edit operation, then the Levenshtein distance between hermetic and memetics is 2: substitute "m" for "h", and delete "r".

If you have used a spell checker, then you've seen the Levenshtein algorithm at work. One of the most common applications is determining what you may have been trying to say when you typed "sharp eye for detal". Most decent spell checkers also take into consideration probabilities of certain common misspellings. Really sophisticated spell checkers use soundex databases in conjunction with Levenshtein computations to decipher phonetic spellings or simply really, really bad spelling errors.

Applications based on computing the Levenshtein distance between data sets are not limited to simple text operations. Other practical applications include speech recognition, handwriting recognition, plagiarism detection software, and even DNA analysis.

Levenshtein distance is not the same as Hamming distance; Hamming distance is only used to compare data sets of the same size.

The goal of the Levenshtein distance, named after the Russian scientist Vladimir Levenshtein (1) (also called edit distance or edition distance) is to measure the similarities between two strings (sequences of elements).

One string can be transformed into another thanks to three operations : insertion of a character, deletion of a character and substitution of a character by another. A sequence of operations which transform one string into another is called an edit sequence. The Levenshtein distance is the smallest edit sequence. The algorithm described below gives both the distance and all the shortest edit sequences.

The most frequent usage of this algorithm is probably spell checking. A basic spell checker will lookup every word of a text in a dictionary. If the word is present then it is correctly spelled. Otherwise it gives the user a list of words of the dictionary, those with the smallest distance to the misspelled word. Search engines such as Google also use this method (or an optimized version of it) on keywords submitted to give more accurate results.

Since this method measures similarity, it is also used in the areas of shape recognition, speech recognition (probably using soundex techniques as well), plagiarism detection or even in the field of Computational Biology with DNA analysis (DNA is a sequence of A, C, G and T proteins).

The basic exponential algorithm

Suppose you wish to compute the distance d(s1,s2) between two strings s1 and s2. The idea is to express d(s1,s2) as a function of d(p,q) where p and q are smaller substrings of s1 and s2 in order to get a recursive algorithm (which will end since a string has a positive length).

  • If s1=="" (resp. s2==""), the distance is obviously the number of characters of s2 (resp. s1).
  • If s1!="" and s2!="", then those strings can be written s1=s1'+c1 and s2=s2'+c2 where c1 (resp. c2) is the last character of s1 (resp. s2).
    • If c1==c2, then obviously d(s1,s2) = d(s1',s2')
    • If c1!=c2, then d(s1,s2) = min( d(s1',s2'+c2), d(s1'+c1,s2') )

The last point needs to be elaborated on. If the last character of s1 and s2 isn't the same, then the shortest edit sequence from s1 to s2 involves either deleting the last character of s1 (ie. d(s1',s2'+c2)) or deleting the last character of s2 (ie. d(s1'+c1,s2')). Since we are interested in the minimum, this explains the use of the min function.

Because it is needed to compute two distances on shorter strings at almost each step, the complexity of this algorithm is exponential which renders it only of theoretical interest.

A nicer algorithm : Needleman Wunsch algorithm

There exists a dynamic programming version of the algorithm known as the Needleman Wunsch algorithm which has a complexity of O(n2) if n is the size of the strings s1 and s2 which makes it far more practicable than the above algorithm obtained by induction.

The idea is to notice that in the above algorithm some distance computations can be performed several times. Say you wish to compute d(bbcd, abdc), then since the last character differs, you will need d(bbcd,abd) and d(bbc,abdc). The first leads to compute d(bb,ab) and d(bbc,a). The second d(b,abd) and d(bb, ab). As you can see, d(bb, ab) is computed more than once which is a waste of time.

Notations : s(i..j) is the substring formed of characters i through j of string s. s(i) is the character number i of s.

The enhanced algorithm proceeds by filling a two dimensional matrix M, where M(i,j) is the distance between substrings s1(1..i) and s2(1..j).

  • Create a matrix M of size (|s1|+1, |s2|+1).
  • M(i,0) = i for i=0..|s1| since we know those distances
  • M(0,j) = j for j=0..|s2| since we also know those distances
  • For the rest of the matrix, M(i,j) = min of :
    • M(i-1, j-1) + 1 if s1(i)!=s2(j), M(i-1,j-1) otherwise. This corresponds to the substitution of characters s1(i) and s2(i).
    • M(i-1, j) + 1. This corrsponds to the insertion of character s1(i) or equivalently the deletion of s2(j).
    • M(i,j-1) + 1. This corresponds to the insertion of character s2(i) or equivalently the deletion of s1(i).

Levenshtein distance in action

One important thing to notice is that the costs of a deletion, insertion or substitution can be changed to have a different value than 1. For example say you wish to correct typos, it might be a good idea to suppose that missing characters are very frequent (you don't press the key hard enough because you type too fast), substitutions are frequent (you press the wrong key) and insertions are rare. The cost of a deletion would then be 3, that of a substitution 2 and that of an insertion 1.

You can even further extend this method by assigning variable costs for substitution based on the metric distance between the keys.

The Unix command diff uses a Levenshtein-distance-like algorithm to produce edit scripts. The typical use of diff is to apply patches to source code files. This is done by applications such as CVS. Instead of packaging the whole new file, the CVS server only sends a small diff file which instructs the CVS client which lines to remove, add or modify in the source code.

The Smith Waterman algorithm is a nice variant which compares all substrings and returns the two with the highest similarity score and runs in O(n2). Thanks to ariels for that addition

Closing words...

Levenshtein distance algorithm isn't the best to compute the distance or find the shortest sequence of operations needed to transform one string into another. However it is a good starting point since it is really easy to implement.

The algorithm described above also gives the edit sequences. Follow a path from the lower right corner to the upper left one, only moving north, west and northwest (insertion, deletion and substitution) with one constraint : the sequence of numbers you land on must be decreasing.


Sources :
(1) V. I. Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. Doklady Akademii Nauk SSSR 163(4) p845-848, 1965
http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Dynamic/Edit/ - a very nice web page
http://www.merriampark.com/ld.htm - has a Java applet that implements the algorithm, C++ and VB source code as well

Log in or register to write something here or to contact authors.