A database designed to capture the relationships between genome
Due to the explosion in data from sequencing projects
it has become essential to relate genes in one organism
to genes in
another. Therefore, homologs
are grouped into families of genes such
that those from different species
and those from within
species are paralogs
Although this might seem like circular logic, given the definitions of
these two terms, there is a difference between the protein and
the gene it codes for. Amino acid sequences are degenerate, in
that several triplets code for one amino acid.
The minimal COG is a triangle of so called 'best hits' between orthologs or orthologous groups of paralogs. So if 'A' and 'B' are orthologs, 'A' and 'a' paralogs: ABC is a triangle (A-B, A-C and B-C) of orthologs and
(Aa)(Bb)(Cc) is a triangle of pairs of paralogs.
This is most useful when most of the genes code for known proteins, as the
unknown members of the group can be assigned (tentative) function. Since a
large fraction of genes in sequenced genomes are unknown,
this field of Structural Genomics could be useful.