## What's the problem?

You want to know the value of a string. You had the string, in fact you had several copies, but someone's smashed them all into bits. What you have now is fragments of the original string with overlaps, but no clue to where they overlap, or to the order they were in. What you want to do, of course, is to recreate the original string and find whatever interesting messages are lurking within.

## Is this the sort of problem I'm likely to have?

That depends. Are you interested in assembling the human genome? If so, you're in luck. The technique used to read huge amounts of genes (such as the 3,000,000,000 in the human genome) is known as shotgun sequencing. Basically, the DNA is smashed to bits and then read; leaving exactly the sort of problem outlined above.

## So how do I do it?

What you're trying to do is find the shortest common superstring of all these substrings. Unfortunately, this problem is NP-Hard, which is basically Computer Science speak for 'bloody time-consuming'. It is difficult to get the correct solution, but sometimes a very good solution is good enough, and there's a quicker way to a near-as-dammit answer.

These things are always easiest to explain with an example, so let's get one of those, shall we?

## Example Problem

We have the following fragment of a string floating about and want to recreate the original:

aswhit		hite		tlela
little		lelam		tsfle
ecewas		leecew

Now, I'm sure the slightly perceptive of you can probably guess what the original string was, but imagine thousands of strings, each thousands of characters long, containing only the characters a, c, g and t. Not so smug now, are you?

### Step One: Build a graph

Now, ASCII art doesn't lend itself well to this sort of thing, so you're going to have to use your imaginations (or a pen and paper). Ready? Good. For each fragment draw an arrow from it to any other fragment it overlaps with marked a numberm, which is the length of the overlap. For example, for the fragment 'maryha' the overlaps look like this*:

maryha
aswhit
^------ 1

maryha
^^^^------ 4

maryha
^^^------ 3

So you draw lines from 'maryha' to each of those other fragments, marked with the correct overlap size. Now draw links to the fragments that overlap with the three other fragments and then link other fragments linked to other fragments and so on, until the page is a mess of fragments, lines, arrowheads and numbers.

### Step 2: Sort edges by length

Look round the graph and pick out all the edges (the links) and put them in a list sorted by the length (the value).

leecew	-> ecewas  (4)
aswhit	-> hite	   (3)
lelam	-> lambit  (3)
hite	-> teass   (2)
teass	-> ssnow   (2)
tlela	-> lambit  (2)
ecewas	-> aswhit  (2)
tsfle	-> leecew  (2)
little	-> leecew  (2)
little	-> lelam   (2)
maryha	-> aswhit  (1)
aswhit	-> tlela   (1)
lambit  -> teass   (1)
lambit	-> tsfle   (1)
tsfle	-> ecewas  (1)

### Step 3: Iteratively choose largest edge

For each iteration, take the largest (highest weighted) edge with the following properties:

• It does not start or finish the same as the finish of a previously chosen edge (or it would form a cycle)
• It does not start the same as the start of a previously chosen edge (or we'd build a tree)

### Step 3.1

We choose leecew -> ecewas, since it's the highest edge.

### Step 3.3

Let's take some more edges out of our list:

leecew	-> ecewas  (4)
aswhit	-> hite	   (3)
lelam	-> lambit  (3)
hite	-> teass   (2)
teass	-> ssnow   (2)
tlela	-> lambit  (2)
ecewas	-> aswhit  (2)
tsfle	-> leecew  (2)
little	-> leecew  (2)
little	-> lelam   (2)
maryha	-> aswhit  (1)
aswhit	-> tlela   (1)
lambit  -> teass   (1)
lambit	-> tsfle   (1)
tsfle	-> ecewas  (1)

### Steps 3.4 - 3.9

The following steps all follow the same pattern as the previous ones. I think it would make a pretty boring node if I showed all the steps so we'll skip them and go on to the last step - I promise you can trust me that this is indeed how it works out.

### Step 4: Putting it all together

So let's have a look at all the edges we've chosen:

leecew -> ecewas
aswhit -> hite
lelam  -> lambit
hite   -> teass
teass  -> ssnow
lambit -> tsfle

Then we can put these together in an order using full overlaps, as follows:

lelam > lambit lambit > tsfle
leecew > ecewas
aswhit > hite hite > teass teass > ssnow

Or, to show it better: