A database designed to capture the relationships between
genomes.
Due to the explosion in data from
sequencing projects
it has become essential to relate genes in one
organism to genes in
another. Therefore,
homologs are grouped into families of genes such
that those from different
species are
orthologs and those from within
species are
paralogs.
Although this might seem like circular logic, given the definitions of
these two terms, there is a difference between the protein and
the gene it codes for. Amino acid sequences are degenerate, in
that several triplets code for one amino acid.
The minimal COG is a triangle of so called 'best hits' between orthologs or orthologous groups of paralogs. So if 'A' and 'B' are orthologs, 'A' and 'a' paralogs: ABC is a triangle (A-B, A-C and B-C) of orthologs and
(Aa)(Bb)(Cc) is a triangle of pairs of paralogs.
This is most useful when most of the genes code for known proteins, as the
unknown members of the group can be assigned (tentative) function. Since a
large fraction of genes in sequenced genomes are unknown,
this field of Structural Genomics could be useful.