The compression metaphor above is formalized by information entropy. Without entering equation territory, the question we ask has two steps:
- For a given word size (two pairs, three pairs,...), how unique is a given word in your genome and
- What's the "total uniqueness"?
Information entropy answers those things in a beautiful way that's both simple and intricate
. It's also closely connected to the compression answer.
If we know something about structure, then we might have to trot out Kolmogorov complexity (for instance, the digits of pi provide for high-entropy noise that won't compress well, but the knowledge that it's pi we're talking about gives us well-behaving series for generating the data that fit in a really small place), but if we don't, that's the question you should be asking: not how many bits of data, but how many bits of entropy?