longest common subsequence (thing) by Tom Rook

The longest common subsequence problem (LCS) appears frequently in introductions to dynamic programming, because a nice DP algorithm falls out of it naturally. Let's first do a quick review of what the LCS means.

A subsequence of a string is any string that can be obtained by removing characters from it, without changing the order of any of the characters left in it. For example, the string abc has eight subsequences: abc, ab, bc, ac, a, b, c, and the empty string. The LCS of two strings A and B is the longest string that is a subsequence of both A and B. The LCS is not necessarily unique; witness The Alchemist's example above, where the longest common subsequences of apple and pale are ale and ple. The length of the LCS is unique; though. (This stands to reason, since two strings cannot both be the longest if they have different lengths!) When we say "the LCS", then, we normally mean "any longest common subsequence," with the understanding that you and I may obtain different answers and yet both be right.

To find the LCS of non-empty two strings A and B, we will consider the final characters of each string. Let us denote the final character of the first string as az, and the rest of the string as a. Similarly, we will separate the second string into b and bz, where bz is its final character and b is everything before that. Now:

If az = bz, we can conclude that the LCS ends in that character az. (We can prove this by contradiction: suppose LCS(A, B) does not end in az. Then we could add az to the end of it, creating a longer common subsequence. So our original LCS was a fake.) The rest of the LCS then comes from the LCS of a and b. The above example illustrates this case; since apple and pie both end in e, LCS(apple, pale) is given by LCS(appl, pal) + e. In turn, LCS(appl, pal) = LCS(app, pa) + l.

If az != bz, then we know that we can remove the last character of A or B, and LCS(A, B) will stay the same. (Again, this can be proved by contradiction: if we could not remove either of the final characters without changing the LCS, then both az and bz must be the final character of the LCS. This violates the assumption that az != bz.) The question is, which string is it whose final character we can remove -- A or B? We cannot easily tell; so the best we can say is that LCS(A, B) is whichever of LCS(A, b) and LCS(B, a) is longer. Continuing the example above, LCS(app, pa) is the longer of LCS(app, p) and LCS(pa, ap).

We have set up a nice recursion here! All that's left is to specify what happens when one of the strings is empty (since an empty string has no "final character" to speak of). This is easy: if A or B is empty, LCS(A, B) is the empty string. So in summary, we have the following definition for LCS: Let last(X) denote the last character of X, and rest(X) denote everything before the last character. Then

LCS(A, B) = "" if A or B is empty;
otherwise, LCS(rest(A), rest(B)) + last(A) if A and B end in the same character;
otherwise, the longer of LCS(A, rest(B)) and LCS(B, rest(A)).

This definition suggests the following Perl implementation. We use memoization (provided by the Memoize module from CPAN) to avoid an exponential running time.

# lcs.pl
use strict;
use Memoize;

sub longerOf {
    my ($x, $y) = @_;
    return (length $x > length $y) ? $x : $y;
}

memoize('lcs');
sub lcs {
    my ($a, $b) = @_;

    if ($a eq "" || $b eq ""){
        return "";
    }

    my ($az, $bz) = (chop $a, chop $b);
    if ($az eq $bz){
        return lcs($a, $b) . $az;
    } else {
        return longerOf(
                  lcs($a . $az, $b),
                  lcs($b . $bz, $a));
    }
}

while (1){
    print "1: "; my $a = <>; chomp $a;
    print "2: "; my $b = <>; chomp $b;
    print "LCS: ", lcs($a, $b), "\n\n";
}

This program loops, taking two strings and printing their LCS. A sample run looks like this, where program output is in bold and user input is non-bold:

1: apple
2: pale
LCS: ple

Hit Ctrl-C to end the loop.

Our algorithm has worst-case running time O(mn), where m and n are the lengths of the strings. This is pretty good, and close to the optimal.

The LCS algorithm pops up now and again in unexpected places. One example is the following problem: Given a sequence of numbers, find the longest non-decreasing subsequence contained in it. For example, the longest non-decreasing subsequence of 1, 4, 3, 5, 8, 6, 7 is 1, 3, 5, 6, 7 (or 1, 4, 5, 6, 7). Can you think of a way to use the LCS algorithm to solve this? The answer is given below, in ROT13.

Yrg'f pnyy gur bevtvany neenl N. Znxr n pbcl bs vg pnyyrq O, gura fbeg O. Eha gur ypf nytbevguz ba N naq O. Rira gubhtu jr ner ab ybatre qrnyvat jvgu fgevatf bs punenpgref ohg jvgu neenlf bs ahzoref, gur nytbevguz jbexf gur fnzr.

A good treatment of the LCS algorithm is given in Introduction to Algorithms by Cormen, Leiserson, Rivest and Stein. I got my information from that book and from a class I took (which used that for its text).

dynamic programming	Proof that there exist irrational a,b for which a^b is rational	Proof by contradiction	Computers and Intractability: A Guide to the Theory of NP-Completeness
Memoization	Subsequence	Bioinformatics	EDB reads his Message Inbox
EDB is my Everything2 Deity	Euclid's algorithm	exponential	Now and Again
heuristic	loop	output	input
unique	String	algorithm	Perl