contig - Everything2.com

A contig is an unambiguously assembled length of DNA sequence that does not overlap any other contig. Contigs start and end because there is not enough genetic information to eliminate any ambiguity in between the contigs.

To really understand what contigs are, though, a wider view of genome sequencing approaches must be looked at.

Let's start off by looking at what exactly DNA is. DNA is a large molecule which encodes the fundamental "recipe" for us. Proteins and other parts of a cell analyze the DNA and converts the information contained within into a new protein, enzyme, or other useful molecule. Repeat this a few trillion times and you have a complex organism; a human being, for one.

DNA itself encodes information, and a lot of it. In order to compress this huge amount of information down into something so small, it has to be very clever with the way that it stores information. So, each little piece of information is stored as one of four very distinct amino acids: guanine, adenine, thymine, and cytosine. These four "letters" make up our genetic alphabet. By translating these letters, very complex molecules can be made, and given enough complex molecules, we have a complex organism.

The four "letters" (guanine, adenine, thymine, and cytosine) are often recorded by scientists in their data as four very sensible letters: G, A, T, and C, respectively.

Now, how can we figure out what these letters are? How can we know what words and phrases they are spelling out; what recipes they are describing? Welcome to the wonderful world of molecular biology: biochemistry and genetics.

Actually retrieving the order of the "letters" is a very difficult task, but our scientists are up to the challenge. Over the last fifty years, scientists have developed techniques which are able to randomly (well, randomly for the purposes of this description; there are some methods to reduce the randomness, but this is not a primer on biochemistry) retrieve pieces of the DNA sequence of about 500 "letters" in length. It works this way because we basically shatter the DNA and then pick up whatever pieces we can. We have no idea where in the sequence these pieces come from; we just know that these little sequences of letters are correct and that they are in fact found somewhere in the DNA sequence.

Now, the human genome measures roughly 3 billion base pairs ("letters") in length. Since we can only retrieve little pieces, we have a mighty big puzzle to solve. We solve this puzzle by shattering the DNA many times and sequencing many, many little pieces. There is a great deal of overlap, of course, but it is this overlap that allows us to figure out the order of the sequence.

The basic unit of the overall sequence is the contig. A contig, as was stated at the start, is an unambiguously assembled length of DNA sequence that does not overlap any other contig. Taking all of these little pieces and assembling them logically into contigs is an extremely large problem, and one that provides a big part of the emerging field of bioinformatics. Let's take a simple look at this problem.

Let's say we have a sequence that we don't know anything about. Here is this sequence:

It is the opinion of the entire staff that Dexter is criminally insane.

Our sequencing technique is pretty poor: we can only get pieces of length sixteen. So we shatter the "sequence" and start retrieving the little pieces. After a while, these are the pieces that we come up with:

Sequence 1:  "It is the opinio"
Sequence 2:  "iminally insane."
Sequence 3:  "n of the entire "
Sequence 4:  "ff that Dexter i"
Sequence 5:  "the opinion of t"
Sequence 6:  "is criminally in"
Sequence 7:  "staff that Dexte"
Sequence 8:  "that Dexter is c"
Sequence 9:  "pinion of the en"
Sequence 10: "Dexter is crimin"

What we do with these is start trying to line them up as best we can. There are a myriad of ways of doing this; it is a very hot field of scientific research at the moment, because dealing with millions of sequences and putting together larger sequences that are pieces of something billions of characters in length (the sequences are stored on the computer as the letters A, C, G, and T, as mentioned above) is a very large problem, indeed. Let's assemble these together to see what we get:

Sequence 1:  "It is the opinio"
Sequence 2:                                                         "iminally insane."
Sequence 3:                  "n of the entire "
Sequence 4:                                     "ff that Dexter i"
Sequence 5:        "the opinion of t"
Sequence 6:                                                    "is criminally in"
Sequence 7:                                  "staff that Dexte"
Sequence 8:                                        "that Dexter is c"
Sequence 9:             "pinion of the en"
Sequence 10:                                            "Dexter is crimin"

It's clear that sequences 1, 3, 5, and 9 line up into one group, and the rest line up in another. There are a lot of ways to verify this line-up and grouping; solving that problem in an optimal way is a major part of the computational problem of contig assembly. From the sequences above, we can get the following contigs:

Sequence 1:  "It is the opinio"
                    ||||||||||
Sequence 5:        "the opinion of t"
                         |||||||||||
Sequence 9:             "pinion of the en"
                              |||||||||||
Sequence 3:                  "n of the entire "
-----------------------------------------------
Contig 1:    "It is the opinion of the entire "


Sequence 7:  "staff that Dexte"
                 |||||||||||||
Sequence 4:     "ff that Dexter i"
                    |||||||||||||
Sequence 8:        "that Dexter is c"
                         |||||||||||
Sequence 10:            "Dexter is crimin"
                                |||||||||
Sequence 6:                    "is criminally in"
                                     |||||||||||
Sequence 2:                         "iminally insane."
------------------------------------------------------
Contig 2:    "staff that Dexter is criminally insane."

And that's how contigs are assembled. Taking into consideration the enormous length of the sequence being assembled, the biological fact that some pieces are harder to break up in a DNA sequence than others (meaning the pieces aren't randomly distributed), and the fact that there are often errors when dealing with that large of an amount of information, it becomes clear that the assembly of contigs from the genetic data that we do have is a difficult but vital topic of research.

shotgun sequencing	base pair	Human Genome Project	sequence clustering
P-against-all problem	Overgo	genetics	molecular biology
EST	Biochemistry	computation	Bioinformatics
enzyme	genome	protein	DNA