A database designed to capture the relationships between genomes. Due to the explosion in data from sequencing projects it has become essential to relate genes in one organism to genes in another. Therefore, homologs are grouped into families of genes such that those from different species are orthologs and those from within species are paralogs.

Although this might seem like circular logic, given the definitions of these two terms, there is a difference between the protein and the gene it codes for. Amino acid sequences are degenerate, in that several triplets code for one amino acid. The minimal COG is a triangle of so called 'best hits' between orthologs or orthologous groups of paralogs. So if 'A' and 'B' are orthologs, 'A' and 'a' paralogs: ABC is a triangle (A-B, A-C and B-C) of orthologs and (Aa)(Bb)(Cc) is a triangle of pairs of paralogs.

This is most useful when most of the genes code for known proteins, as the unknown members of the group can be assigned (tentative) function. Since a large fraction of genes in sequenced genomes are unknown, this field of Structural Genomics could be useful.