The distance separating two character strings can be seen as the minimum number of elementary alterations to apply to a string in order to obtain a second one (cf. Wagner et Fisher 1974, and Algorithm 1). Each alteration is assigned a cost, and the sequence of alterations that gives the lower cost is sought. The elementary alterations are:

  • suppression (S)
  • insertion (I)
  • change (C)

Algorithm 1: Edition distance from the X string to the Y string (Wagner-Fisher)

Required: D (0,0)=0
Required: n = length of the X string
Required: m = length of the Y string
Required: C(a,b) = cost of the alteration from a to b
Required: lambda = void
Assure : D (m,n) = edition distance from X to Y
  For i = 0...n Do
    For j = 0...m Do
      If i != j != 0 Then
        D (i,j) = Min (D (i-1, j-1) + C (Xi, Yj), D (Xi, lambda), D (i, j-1) + C (lambda, Yi))
      End If
    End For
  End For

Milgram Milgram 1993 gives the example of the transformation of aabcb into ababd. Figure 1 gives some of the pathes allowing to go from aabcb to ababd. Given a cost of 1 for each of the alterations, the edition distance from aabcb to ababd is three. It would be different if a suppression cost was 0.5, the cost of an insertion 0.75, and that of change 0.25. Its value would then be 1.5.

Figure 1: Some of the shortest pathes from aabcb to ababd.

                                   aabcb
                                  /  |   \
                                 S   I    C
                               /     |      \
                            abcb   aabcbd   aabab
                              |       \       |
                              |         \     |
                              C           C   I
                              |             \ |
                            abab           aababd
                                \          /
                                  I      S
                                   \    /
                                   ababd

For the BAsCET's application on bibliographic references recognition, each transformation cost was given a 0.6 value, except in some change case (C). When a capital changes to its lowercase, or the opposite, its alteration cost is given a value of 0.1. It is then considered that these are almost the same characters. Indeed, it happens that a symbol beginning with a capital in the references database was translated into a symbol beginning with a lowercase. When a letter has to be changed to another letter (not to a punctuation character, or other...), 0.9 is assigned to this cost. When a digit a has to be changed to another digit b, 6 |a-b| / 100 is assigned to the cost. Thus, the cost varies between 0 (when a = b) and 0.54 depending on the mathematical distance of the two digits. It's useful when a user made a key-mistake (what's the real word? /msg me, please, if you know), but mostly when an agent looks for a number with the same length, but with a different value. A typical example is the search for a year: it is possible that the reference's year does not exist in Concept Network (in the database), but it is nevertheless useful that this year was recognized as a year, that is to say as a near neighbor of a year in the Concept Network.

The distance d between two strings a and b allows to easily compute the similarity rate s(a,b), depending also on their lengths (l(a) and l(b)): s(a,b) = ( 1 - d(a,b) ) / MAX( l(a), l(b) ) (see search for similar sub-strings).


Bibliography

Wagner et Fisher1974
R. A. Wagner et M. J. Fisher.
The string to string correction problem.
Communications of the ACM, 20(10), May 1974.
Milgram 1993
M. Milgram.
Reconnaissance des formes -- Méthodes numériques et connexionnistes.
Armand Colin, 1993.