An interesting pair of proteins (trpA and trpB), these form the best known obligate enzyme ternary complex. trpA makes an intermediate from indole-glycerol phosphate (IGP) and serine. This is then channeled to the active site of its partner protein (trpB) which converts it into tryptophan. Tryptophan is an essential amino acid, which the cell cannot afford to make to much or too little of. The trp operon regulates expression of all the enzymes responsible for its biosynthesis.

Oddly enough, the chemical translation of the primary sequence of this protein has the distinct and distinctly dubious honour of being the 'longest word' in the English language. There are a number of problems with this.

Firstly, there is no reason for this 'word'. It starts with "methionylglutaminylarginyltyrosyl" and continues for another 1880 letters. In fact, if you really needed to refer to this protein in converstion or in print you would simply say "tryptophan synthetase". Not just because of the sheer ponderous length of a 2000-letter word, but also because it does not convey any meaningful information about the protein.

Of course, there is meaning in the name. To get at it, you have to understand that proteins are chains of amino acids, and that each of these 20 or so organic molecules has a common name (such as alanine, methionine, or tryptophan) but also a three letter code ('ala', 'met', and 'trp') and a one letter code ('A', 'M', and 'W').

The 2000 or so letter 'word' that is tryptophan sythetase's 'chemical name' is really just a huge concatenation of the common names (or, rather, the radicals like 'methionyl' rather than 'methionine') for each of the protein's amino acids. A far better way to write this information would use the one-letter code. So "methionylglutaminylarginyltyrosyl" becomes simply "MEKY", since "methionyl" refers to methionine which has the code "M", and so on. Tyrosine is represented by a 'Y' because Threonine is 'T'.

Very simple, really. So rather than print out 1913 letter name you would instead use the 267 letter 'primary sequence' which is the one letter code referred to, before. There are entire databases of these sequences, just as there are of DNA, RNA, and oligosaccharides. Which brings me to my next point; even if we accepted that the 'full' name was a real word, with some valid use, it certainly wouldn't be unique. More importantly, it definately wouldn't be the longest.

The longest known protein is undoubtedly the muscle protein called titin. Hair is made of protein, of course, and that's much, much longer, but also very simple. Of course, this is all fairly academic - the point is that there are longer primary sequences than that of tryptophan sythetase. Titin is a monstrous 33,423 amino acids long, and this translates to a chemical 'name' of an astounding 237,205 letters. Clearly this beats the measly 2000 letters of tryptophan sythetase by a factor of 100!

If we are really being picky, there are DNA sequences that could be transformed in a similar way. That calculation is left as an exercise for the reader...


If you want to check this yourself, I used the sequence of an isoform of titin with entrez accession number NP_596869. Any titin would do, I suppose, although it does exhibit sequence variability. The python script used is as follows:

s = {'a':7,'c':8,'e':13,'d':13,'g':7,'f':13,'i':10,'h':9,'k':6,'m':10,'l':7,'n':10,'q':9,'p':7,'s':6,'r':8,'t':9,'w':10,'v':6,'y':8}
r = {'a':8,'c':9,'e':8,'d':8,'g':6,'f':12,'i':9,'h':8,'k':5,'m':9,'l':6,'n':11,'q':8,'p':6,'s':5,'r':7,'t':8,'w':10,'v':5,'y':7}
seq = open("titin.seq")
onelettername = "".join(["".join(line.rstrip("\n").split(" ")[1:]) for line in seq])
print sum([r[letter] for letter in onelettername[:-1]]) + s[onelettername[-1]]
seq.close()
It is pretty obfuscated, simply because I like generator expressions rather than any good reason. Anyone who wants the name, can contact me, and I will /msg it to them...

Log in or register to write something here or to contact authors.