The Problem

Consider two short files with the following content:

File 1
Mary had a little lamb whose fleece was white as snow
File 2
Mary had a dog and Mary had a little lamb with fleece as white as snow
The exercise is to find the minimal set of changes to go from one file to the other. This is often needed, for example, to create patches for programs. A patch is an update and it should be as small as possible so as to accurately reflect the extent of a change.

Common Solutions

Many file comparison programs match file streams as early as possible and then use a limited-size window to try to re-synchronize. A possible output from such a strategy, for my example, would be:

3 words match
(Mary had a)
discard 5 words
(little lamb whose fleece was)
insert 7 words
(dog and Mary had a little lamb with fleece as)
3 words match
(white as snow)
This is correct but obviously far from optimal.

My Strategy

  1. Take the entire texts to be compared and "slide" them against each other (i.e. start comparing with different offsets into the files) until a relative offset is found that has more matches than any other (or is tied for first place).

    For the example, if we start 5 words into the second file, we get a match like this (matching words in stronger type):

                      Mary had a little lamb whose fleece was white as snow
    Mary had a dog and Mary had a little lamb with fleece as white as snow
    (that's 9 matching words, the maximum possible for this sample).

  2. Find the longest consecutive run of matching items. This may well be in the middle of everything, so there will usually text in one or both files before it and after it.

    In our example, Mary had a little lamb wins hands down.

  3. Identify the chunks of text from both files before this match and the chunks of text from both files after this match. Apply this same strategy to each set, starting with (1).

    In our example, the "before" set consists of

    –nothing–
    against
    Mary had a dog and
    That's easy, it's a 5 word insert and finished. The "after" set is
    whose fleece was white as snow
    against
    with fleece as white as snow
    which will need another two rounds or so of this algorithm.

Assuming we do our bookkeeping right, we end up with 4 inserted words and two replacements for the example, which is the optimum result.

Discussion

This solution is admittedly brute force. Many computer scientists will scream at this. I'm not sure, but I think the worst-case behavior for this algorithm is O(n4) . That's grounds for defenestrating one of their number from the highest ivory tower. For shame!

My response is: I know, and I don't care. This is a typical case of know your problem domain: this algorithm is intended to work on program source files, and source files tend to have a maximum size well within this algorithm's ability to complete in a tolerable time frame on modern hardware.

Implementation Details

  • My example deals with individual words, but program patches deal in lines. This algorithm calls for many comparisons, and we certainly don't want to compare entire lines against one another, character by character. So my program computes a CRC value for each line. Thus, each line is represented by a single 32-bit number. With a probability of billions to one, each number corresponds uniquely to a line with a certain content. When all the shuffling is done, the program compares matched lines by content, just to avoid those off-chance false positives. During the matching phase, working with numbers is of course much faster than working with lines of text.
  • I make casual mention of "sliding" the files against one another. This wants to be done right! A careful choice of strategy here will go a long way toward preventing the program from displaying its worst-case behavior. Just as the A* algorithm for shortest path finding starts out by looking in the direction of the target, my approach starts with mutual file offsets of 0. A single run-through will identify identical files, and I'm done. Failing this, I increase the offset one by one, alternating between files. So I compare:
    all of file 1 against file 2 starting at line 2;
    all of file 2 against file 1 starting at line 2;
    all of file 1 against file 2 starting at line 3;
    all of file 2 against file 1 starting at line 3;
    and so on. As soon as all this sliding leaves me comparing fewer lines than the maximum number that matched so far, I know it will never get any better, so I can stop.
  • Sooner or later, someone will try to use a program for a job it's not intended for. There is no source file at my workplace longer than about 10,000 lines, and I believe that any programmer who writes a module longer than this needs to be taken out and shot. However, there are various data files around which are much bigger. Therefore, to keep people from abusing my program, I've hardwired a limit of 15,000 lines into the program; it refuses to process files larger than this, although it could. This is a design decision based on the program's expected problem domain.

Results

My program, based on this algorithm, has been in use for about five years at my workplace and has largely replaced its predecessor, a systems utility developed at the University of Maryland. It runs on a powerful UNISYS 2200 mainframe and usually yields results in less than 5 seconds, rarely requiring more than 20. The older program was known for occasional catastrophic failures, while mine consistently produces results in agreement with "common sense".


Ariels has graciously pointed me at some prior art from the "real" Computer scientists, currently finding extensive use in bioinformatics:

  • There's the Needleman-Wunsch algorithm, which tallies up match scores in an m x n matrix. The algorithm is elegant; I envy the authors for their smarts. Alas, my creaky mainframe doesn't have virtual memory, so Needleman and Wunsh would have crashed and burned long before reaching my arbitrary limit of 15,000 lines. Also, it looks to me like the amount of work done is O(n3).
  • The Smith-Waterman algorithm looks like a cross between Needleman-Wunsch and mine: It pinpoints local matches but allows much finer weighting of penalties for differences. Again, at the heart of this algorithm is an array that only a RAM chip manufacturer could love.
  • Nowadays, a popular "standard" differencing tool is GNU diff, which by default runs an algorithm by Eugene Myers. At the time I wrote this program, I was aware of diff but not the origin of its algorithm. I was then and still am loath to try to analyze diff's source code. However, diff's documentation warns that diff will by default try to cut corners by ignoring the initial and final set of matching lines, which may yield a less than optimum result. Of course, there is an option to prevent this behavior. Also, for files too large to fit in memory, diff will delegate the job to an older, multi-pass algorithm. It seems like the authors of diff have thought of everything, but that's precisely the kind of headache my brute force approach blissfully sidesteps.

    Ironically, they finally got a C compiler working on the 2200 last year. This means that it is now an option to port the actual diff code.

Alas, all this new information won't inspire me to write a new version of my cheerfully ponderous sub-optimal optimal difference utility: I have other projects keeping me busy and I heed Alf's Axiom: "If it's not broken, don't kick it."

Log in or register to write something here or to contact authors.