The distance separating two character strings can be seen as the minimum number of elementary alterations to apply to a string in order to obtain a second one (cf. Wagner et Fisher 1974, and Algorithm 1). Each alteration is assigned a cost, and the sequence of alterations that gives the lower cost is sought. The elementary alterations are:
- suppression (S)
- insertion (I)
- change (C)
Algorithm 1: Edition distance from the X string to the Y string (Wagner-Fisher)
Required: D (0,0)=0
Required: n = length of the X string
Required: m = length of the Y string
Required: C(a,b) = cost of the alteration from a to b
Required: lambda = void
Assure : D (m,n) = edition distance from X to Y
For i = 0...n Do
For j = 0...m Do
If i != j != 0 Then
D (i,j) = Min (D (i-1, j-1) + C (Xi, Yj), D (Xi, lambda), D (i, j-1) + C (lambda, Yi))
End If
End For
End For
Milgram Milgram 1993 gives the example of the transformation of aabcb into ababd. Figure 1 gives some of the pathes allowing to go from aabcb to ababd. Given a cost of 1 for each of the alterations, the edition distance from aabcb to ababd is three. It would be different if a suppression cost was 0.5, the cost of an insertion 0.75, and that of change 0.25. Its value would then be 1.5.
Figure 1: Some of the shortest pathes from aabcb to ababd.
aabcb
/ | \
S I C
/ | \
abcb aabcbd aabab
| \ |
| \ |
C C I
| \ |
abab aababd
\ /
I S
\ /
ababd
For the BAsCET's application on bibliographic references recognition, each transformation cost was given a 0.6 value, except in some change case (C). When a capital changes to its lowercase, or the opposite, its alteration cost is given a value of 0.1. It is then considered that these are almost the same characters. Indeed, it happens that a symbol beginning with a capital in the references database was translated into a symbol beginning with a lowercase. When a letter has to be changed to another letter (not to a punctuation character, or other...), 0.9 is assigned to this cost. When a digit a has to be changed to another digit b, 6 |a-b| / 100 is assigned to the cost. Thus, the cost varies between 0 (when a = b) and 0.54 depending on the mathematical distance of the two digits. It's useful when a user made a key-mistake (what's the real word? /msg me, please, if you know), but mostly when an agent looks for a number with the same length, but with a different value. A typical example is the search for a year: it is possible that the reference's year does not exist in Concept Network (in the database), but it is nevertheless useful that this year was recognized as a year, that is to say as a near neighbor of a year in the Concept Network.
The distance d between two strings a and b allows to easily compute the similarity rate s(a,b), depending also on their lengths (l(a) and l(b)): s(a,b) = ( 1 - d(a,b) ) / MAX( l(a), l(b) ) (see search for similar sub-strings).
Bibliography
- Wagner et Fisher1974
-
R. A. Wagner et M. J. Fisher.
The string to string correction problem.
Communications of the ACM, 20(10), May 1974.
- Milgram 1993
-
M. Milgram.
Reconnaissance des formes -- Méthodes numériques et connexionnistes.
Armand Colin, 1993.