I can only offer you this. The human genome is ~3Gbp long, or 6Gbits. Luckily, I have a copy (HGP output!) stashed away somewhere...

% cat ./NCBI/DNAinput*[^x] | gzip -c > /home/ariels/abc.gz
% gzip -l /home/ariels/abc.gz
compressed  uncompr. ratio uncompressed_name
875526472 2940172494  70.2% ../../abc

Which is really pathetic. Even accounting for various housekeeping information we store in the files (apart from the genomic sequence), I'd expect to manage 25% at least (the amount for straightforward 2-bit per bp scrunching).

bzip2 doesn't do much better -- it gives a compressed size of 803381365bytes, for a ratio of ~72.7% -- still less than the minimal decent ratio of 75%!

The compression metaphor above is formalized by information entropy. Without entering equation territory, the question we ask has two steps:

  • For a given word size (two pairs, three pairs,...), how unique is a given word in your genome and
  • What's the "total uniqueness"?

Information entropy answers those things in a beautiful way that's both simple and intricate. It's also closely connected to the compression answer.

If we know something about structure, then we might have to trot out Kolmogorov complexity (for instance, the digits of pi provide for high-entropy noise that won't compress well, but the knowledge that it's pi we're talking about gives us well-behaving series for generating the data that fit in a really small place), but if we don't, that's the question you should be asking: not how many bits of data, but how many bits of entropy?

So, there's a simple answer, and then a series of more complicated ones:

The easy option

You can download the most recent assembly of the human genome from here. It's in minimal 2bit format (ie. each base pair is encoded as two bits), so it's pretty close to pure genetic code, and comes in at 778 MB, or about 6526337024 bits.

Problem solved

But it's not that simple

Of course, that would make for a very boring write-up: the truth is that it's rather more complicated. For a start, the human genome assembly is not actually complete: there's large chunks of highly repetitive DNA sequence that it doesn't include, such as the telomere sequences at the end of chromosomes, because it's hard to sequence repetitive sequences accurately and we know what the basic repeat pattern is. So our orginal figure is probably on the low side.

And it gets worse

And then there's the question of epigenetics. As well as the base DNA sequence, there's a whole host of chemical modifications that can be made to either the DNA, or the histones, the proteins which pack up DNA and store it in a compact state so that it can fit inside the cell. Well-known modifications include methylation of C-G pairs and methylation and phosphorylation of particular histone proteins, all of which have been shown to have a direct effect on the expression of particular genes. Then there's higher levels of genome organisation, such as the higher-level structure of DNA and its packaging proteins, collectively known as chromatin. There's even a higher level argument that since proteins are inherited between cells, the level of a particular protein could be kind of inheritable information.

How important these epigenetic factors are is a serious topic of debate at the moment: some theorise that they are entirely determined by the genetic code, while others claim they represent a separate source of heritable information. For the moment, they seem to be important, but mostly determined by the genome: a few years back, it was discovered that by expressing a combination of four different genes, you could "reset" most cells into stem cells, which is a pretty strong argument that the genome is the main mechanism for information storage. For the purposes of this discussion, I'll cheat, and say that the question posed by the node refers to the human genome, so we can safely ignore the epigenome.

Everybody's Special

Then there's the question of variation: you may have noticed, but people tend to be different to each other. This is reflected in the genome: there is no single "Human Genome", just lots of individual genomes. Common differences include variable numbers of repeats at specific loci in the genome (these are often used for forensic DNA evidence), or variations in single letters of the genome (known as Single Nucleotide Polymorphisms, or SNPs): a full description of "The" human genome should include data on all these variations and their relative frequencies. Obviously, this would dramatically increase our orginal estimate.

Signal or Noise?

Of course, if we ignore the epigenetics, the variations, and the bits of the genome we can't sequence, we can look the other way, and ask how much of the human genome is really necessary to make a human. Lets consider the sequence biologically first: it's pretty obvious that a lot of the genome is not actually biologically necessary for making a human: large chunks of it are useless repetitive sequence, or self-replicating sequences such as transposons. The problem is, apart from a few areas: it's very hard to work out what's actually useful. For a while it was thought that most of the genome was unecessary because it didn't produce any proteins, but now we know that a lot of it performs functions that aren't immediately obvious. Beyond saying that it's less than the whole genome, the question of how much of the genome is biologically relevant is a still pretty open.

Then, even having our theoretical, biologically minimal genome, which contains exactly the DNA required to make a human, and no more, there is one final factor we could consider to reduce the bits in the genome, and that is mathematical compression, which should reduce the size of the genome considerably. If you were really hardcore, I imagine that you could write your own compression algorithm to compress the information, as genetic code has some specific idiosyncrasies that would be very amenable to compression. Sadly, I'm not a sufficiently good coder to have ever given this a go, so I can't tell you what the result would be.

In summary

Ultimately the question is too complicated, and contains too many unknown variables, for an accurate answer at the moment. Most scientists stick with the basic genome assembly as a reasonable approximation, for which it works pretty well.

Log in or register to write something here or to contact authors.