In computational linguistics, one area of study is the production of methods for automatic grouping and classification of texts. This may involve identifying the author, the body of knowledge, or the actual language that the text should be identified with. As it turns out, the humble unix compression utility, gzip, can provide us with an effective tool in this endeavour.

When we look at two texts, we have an intuitive notion of what we could call the "semantic distance" between them - two works by Shakespeare are obviously (to us) more closely related than, say a work by Shakespeare and one by Theonomist. We use subtle (or not so subtle) clues in the language to make this judgement.

So an interesting problem for computational linguistics is to find ways for a computer to make similar judgements. To do this, we need to have an algorithmic method for evaluating semantic distance. Often, word and phrase counting is used for this purpose, and can be quite effective, because different authors, different bodies of knowledge, different languages, all have their own statistical signatures, unique (ish) patterns in the relative frequencies of different textual elements.

In a recent paper1, the authors use gzip to create a computable measure of semantic distance, based on what the authors call the relative entropy between two corpora (collections of texts, that is), roughly as follows.

Suppose we have two corpora, A and B. The standard notation for the length of A is |A|. We use gzip to compress A, producing the file we'll call z(A). Since gzip is a compression tool, it's very likely that |A| > | z(A) |, which is just to say that the length of A is greater than the length of z(A), the compressed version.

Now we take our other collection, B, and take one text from it. We'll call this 'b'. We stick b on the end of A, call this A+b, and compress that, giving us z(A+b).

Now we can look at the difference in length between z(A) and z(A+b). That's the simple sum | z(A+b) | - | z(A) |, which we can notate as DAb.

So far so good.

Now do the same for B and B+b, giving us another simple sum: | z(B+b) | - | z(B) |, which we can call DBb.

We can now use the difference between these two figures to estimate how "semantically close" b is to A.

Here's why: suppose A is a collection of Latin scholarship, and B is the text of all the nodes by theonomist. When we use gzip to compress A, it works roughly by building a dictionary of the most common textual elements in the texts comprising A, and then outputting this dictionary and a set of references into this dictionary which when resolved give us back the original text. Because the texts in A have a basic similarity, the dictionary created will tend to become 'optimal' for this kind of text. But the nodes of E2 scholar theonomist will tend to use different textual elements, so when we use the dictionary appropriate to the Latin scholars to compress theo's text, we won't get such efficient compression. In other words, we'd expect DAb to be larger than DBb.

So we can use the sum DAb - DBb to estimate the overall similarity of the patterns of textual elements in b to those in A. Since the actual figure will depend also on the length of the texts we start with, we can just divide the whole thing by DBb, to even things out, giving the formula (DAb - DBb) / DBb.

We can also do the opposite sum, (DBa - DAa) / AAa, using a text 'a' from A, to estimate its closeness to theo's nodes, comprising B.

Adding these two together, we get a generic formula for the 'relative entropy' between our two corpora, A and B, which is written SAB:

SAB   =   (DAb - DBb) / DBb   +   (DBa - DAa) / DAa

Using these formulae, the authors of the paper referenced above achieved some pretty startling results. Using texts by great Italian authors from past centuries, they attempted to identify the author by looking for the lowest distance (DAb - DBb) / DBb between the text in question (b) and the body of the author's other work (A), producing the following table:

```
Author            Texts       Identified correctly

Alighieri           8               8

D'Annunzio          4               4

Deledda            15              15

Fogazzaro           5               4

Guicciardini        6               5

Machiavelli        12              12

Manzoni             4               3

Pirandello         11              11

Salgari            11              10

Svevo               5               5

Verga               9               7

---------------------------------------

totals:            90              84

```
In all but one case, wherever the author was not identified correctly, taking the next closest 'distance' produced the correct result.

Not satisfied with this, they applied their other formula (for SAB) to over 50 Euro-Asiatic translations of the Universal Declaration of Human Rights (possibly the most widely translated document in the world) to compile a "distance matrix", containing all the distances SAB for each distinct pair of translations.

Feeding this matrix into an algorithm normally used for constructing phylogenetic trees, they were able to produce a 'language tree' that exhibited the usual groupings made by linguists: Romance, Celtic, Germanic, Finno-Ugrian, Slavic, Baltic and Altaic. In line with orthodoxy, their method isolated the Basque language and the Afro-Asiatic language Maltese (the latter isolated, presumably, since they were considering no other African languages).

And all done with gzip!

1. Language Trees and Zipping
Dario Benedetto, Emanuele Caglioti, Vittorio Loreto
http://xxx.lanl.gov/abs/cond-mat/0108530

Regards the writeup below, perhaps I should clarify the issues as I see them..

Firstly, I don't think it is seriously in doubt that the technique of Benedetto et al. can produce the results described above. I feel that this alone is enough to absolve me for bringing the results to the attention of the noding public. (In case it has escaped attention, the title of the writeup places the emphasis on gzip, not on "semantic distance", etc.)

Secondly, it is in error to expect a close fit between the significance of scientific work as seen from the journalistic and scientific points of view. Everyone knows the press are generally hopeless at science (I hope!)

Finally, I am forced to admit chronic amusement at the fact that (the no doubt eminent) Joshua Goodman's letter, complaining (at length, though without benefit of the adventurous typography seen below) about the propriety of publication, was itself rejected.

In my view, Benedetto et al.'s result, while it may not be a significant advance in its field, has some intrinsic interest, is worth publishing, and seems not to have been published before. Perhaps if those readers who were upset by its publication did not attach so much importance to it themselves, the unfortunate controversy described below would not have occured. Even our (now anonymous) critic below admits his ire is inspired more by the authors' response to Goodman's outraged critique than by the actual subject of my writeup. This outrage, it seems, is directly proportional to the professional vanity and self-importance of the outragee. Dare we say boundary posturing?

STOP! THE ABOVE WAS A TRICK WRITEUP!
JerboaKolinowski, you should know better than to bait them with that tripe...

Before proceeding with showing you how the above research is lower than a "cool hack" and closer to "poisonous disinformation", let me explain the origin of my venom. While I could stomach Benedetto et al's "Language Trees and Zipping",1 reading it as physicists poorly but earnestly playing computer scientists, I could take no more after reading Benedetto et al's facile and, moreover, smug reply to Joshua Goodman's open letter on their faulty metholodogy, where they write: "the concept of entropy now belongs to Computer Science (sic !)" [sic]. (Yes, you didn't misread that, they actually call it "Computer Science (sic !)")

Let me translate that for you, as the last three words embody the tone of their "research" letter: "Computer science is an actual science? You mean we cannot ignore the fundamentals of experimental research design and get away with our childish fucking antics? Says you!"

How dare you impugn the validity of computer science as a scientific disclipine?

You, you miserable worms, who cannot even conduct proper and meaningful research!

For that, I will tear your arms off, starting ... now:

The most important flaw in their research can be inferred from the above writeup by even those with little background in the field of NLP. Examine the experimental design, as well as the results presented. Benedetto et al would like you to think that their algorithm did "well." Well compared to what? Apparently, compared to their intuitive notion of what "well" is on a data set of their choosing.

The accuracy of an algorithm for any sort of classification makes no sense by itself on one specific domain, especially a simple domain with an absurdly low number of test values. You cannot say an algorithm in a vacuum is "good". "Good" compared to what? It derives its validity from contextualization, namely by comparison to a baseline algorithm. A baseline algorithm is a simple, straightforward algorithm to solve the given problem. An algorithm is "good" if it is competitive with other algorithms for the same problem or, if the algorithm solves a novel problem, at least if the algorithm is competitive with the baseline algorithm.

Here is how the experiment would have proceeded if Benedetto et al knew even the slightest thing about computer science:

• Implement the compression kludge algorithm.
• Implement the baseline algorithm, say Naive Bayesian Classification.
• Test both algorithms on your made up data set with twenty test instances.
• Obtain 95-99% accuracy with kludge algorithm. Obtain 99% (or perhaps even 100%) accuracy with baseline algorithm.
• Discard made up data set for difficult, large data set used by other people researching competitive algorithms.
• Disappear in a cloud of smoke from trying to reconcile your fixed notions about computer science with the recognition of the existence of serious research in computer science.

Joshua Goodman, a respected, established researcher in NLP who works at Microsoft Research, (which is strong despite Microsoft's generally poor software development), wrote "Comment on Language Trees and Zipping"2, a reply to Benedetto et al's letter. Goodman's incisive, eloquent, and restrained letter was rejected by Physical Review Letters.
"I first point out the inappropriateness of publishing a Letter unrelated to physics. Next, I give experimental results showing that the technique used in the Letter is 3 times worse and 17 times slower than a simple baseline, Naive Bayes. And finally, I review the literature, showing that the ideas of the Letter are not novel. I conclude by suggesting that Physical Review Letters should not publish Letters unrelated to physics."

I highly recommend anyone interested in the scientific method and experimental research design read the extended version3 of his comment.

More excerpts of the salient portions of this comment:
(emphasis mine)

Physics journals should not be publishing articles that have nothing to do with physics. It is ... completely reasonable to publish the use of non-physics techniques applied to physics in a physics journal. But this paper applies computer science techniques (gzip!) to computer science problems. This seems extremely inappropriate for a physics publication.

...

Given this argument, I don't think it is really matters whether or not the paper is a good paper--the point is, quite simply, that a good paper that has nothing to do with physics does not belong in a physics journal. As it happens, the paper is a bad paper. First of all, there are many many well known techniques in Computer Science for solving problems of this sort, some complex and some very simple. I decided to try the simplest, easiest to implement, most straightforward algorithm I could think of, an algorithm known as Naive Bayes. I implemented both the zipping algorithm and a Naive Bayes algorithm, and applied them to a similar, but standard problem, 20 Newsgroups, the goal of which is to determine the newsgroup from which a posting came. The zipping procedure is not even that much simpler: it takes about 70 lines of perl script to write the code to split the data into test and training, and an additional 50 to use the zipping method, versus an additional 70 to implement Naive Bayes (a difference of 20 lines in trivial.) The zipping method was 17 times slower and had 3 times as many errors as this simplest standard computer science algorithm.

...

Furthermore, the ideas of this paper are very well known in areas of computer science such as machine learning and statistical natural language processing. ... Even this compression idea dates back (at least) to 1995, when Ken Lang and Rich Caruana tried out the idea for doing newsgroup classification. In the end though, they ended up using Naive Bayes classifiers. They didn't bother to publish the compression idea because they thought it was better viewed as an interesting thought than as a serious classification method. Still, the idea of using compress got around a bit ... Admittedly, however, this technique is not that widely known, because computer scientists don't typically try anything that crude -- in a couple of hours (or less), we can build from scratch tools that work better than this.

Of course, this explains another reason why phyics journals should not be publishing computer science papers: they don't know the field, or the reviews, and so cannt distinguish the good from the bad, this paper being a case in point.

Why am I so bothered by this paper? Well, a big part of it has to do with the amount of press coverage it has received. The authors sent out a press release that got published in Nature Science Update (see http://www.nature.com/nsu/020121/020121-2.html) and Wired Magazine (see http://www.wired.com/news/technology/0,1282,50192,00.html) and picked up by people such as the ACM (Association for Computing Machinery) (see http://www.acm.org/technews/articles/2002-4/0208f.html#item14), who perhaps should have known better than to trust stuff from a physics journal, but made the mistake of assuming that physicists were competent to review the paper. When reputable publications ignore the existence of computer science, and assume that those without computer science training are well qualified to do research and review publications in the area, it hurts the field by allowing outdated or wrong information to be circulated ... It is also insulting.

Benedetto et al could have slunk off and saved some face, but instead penned a retort to J. Goodman's letter, entitled "On J. Goodman's Comment to [sic] Language Trees and Zipping." 4

Their reply is either a humorous serious response or a terrible joke. Its only genius is its eerie ability to simultaneously appear to be an academic troll and mere dead earnest stupidity, and its illustration of Abraham Lincoln's adage: "It is better to keep your mouth shut and to be thought a fool than to open it and to remove all doubt."

Briefly:

• Benedetto et al begin by saying that "it is beyond [their] judging possibilities" to decide whether a paper like theirs should be published in a physics journal. "If then one considers that publication on [sic] Physics Review Letters is subject to a tough peer review probably we could not be blamed for having attempted this submission."
• "Except for learning that the concept of entropy now belongs to Computer Science (sic !), [sic] there is one important point to stress:" They then proceed to describe a general situation in which their paper's treatment of entropy may be loosely applicable to physics.
• They set out to debunk the claim that their algorithm performs poorly. They download the data set used by J. Goodman in his comment and then call into question J. Goodman's ability to duplicate their results in a different (but totally congruent) domain, as they "have never published how this technique works in this case." (Numnuts, it doesn't take a supra-genius to figure out how to apply the algorithm to a similar problem!) They then describe what seems5 to be the exact method of the original paper.
• "It is important to stress that Goodman is criticizing our paper in a very odd way. He claims that our technique fails on an experiment (the subject classification) that was not described in our paper. ... It is worth to stress [sic] how on much more controlled and clean corpora, the efficiency of our method raises considerably." (You puerile, tiny souls, listen again: Most naïve algorithms can accurately classify easy data sets, even one as naïve as yours. You can't just test on a toy domain of your choosing and declare yourselves smart, it doesn't work like that.)
• They vaguely describe their experimental method, which does not include implementing Naive Bayes and testing it themselves to ensure homogenity of testing methods. From the little one can gather from their numerical results, Benedetto et al set up their experiment differently from Goodman's (including but not limited to treating the training data as the test data) and yet compared their calculated accuracy to J. Goodman's calculated accuracy for Naive Bayes in his differently structured tests. Not surprisingly, Benedetto et al's compression method, tried with each of three different compression routines, slightly outperformed Naive Bayes.
• "It would be interesting for instance to know how Naive Bayesian methods perform in the set of problems we have considered in [On Language Trees and Zipping]. ... In this case we can guess that the method would suffer the lack of long training texts and this could also be true for the problem of authorship recognition where the number of texts available per author is very limited. (Instead of wondering or guessing how Naive Bayes performs on small data sets, why don't you implement it and find out? Hint: It works better than zipping.)
By far, the most exciting part of their reply is a pseudo-scandalous republishing of an errant J. Goodman draft. This gem is the most compelling argument against Benedetto et al's form of blasé hand-waving research, but considering that his staid and specific second letter went unpublished, it is little wonder Goodman realized that Benedetto et al and the Physical Review Letters cronies would not get it. Ever more ironic is that Benedetto et al consider this a great outing of Goodman.

From the appendix to their reply:

(The letter appeared only for a few days on the site: http://research.microsoft.com/~joshuago/ and then mysteriously disappeared, substituted by a more serious comment sent to the Editors of Physical Review Letters as well as to cond-mat archives.)
```

Date: Tue, 12 Feb 2002 15:17:12 -0800
From: Joshua Goodman
To: benedetto@mat.uniroma1.it, caglioti@mat.uniroma1.it, loreto@roma1.infn.it
Subject: Open Letter to the editors of Physical Review Letters

An open letter to the editors of Physical Review Letters

Dears Sirs,

I wish to commend you on your open-minded interdisciplinary approach

to science. In particular, I was extremely impressed that you

published an article on Computational Linguistics and Machine Learning

in a journal ostensibly devoted to physics ("Language Trees and

Zipping" by Dario Benedetto, Emanuele Caglioti, and Vittorio Loreto,

28 January 2002, (available at

http://babbage.sissa.it/abs/cond-mat/0108530) with the following

Abstract:

In this Letter we present a very general method for extracting

information from a generic string of characters, e.g., a text, a DNA

sequence, or a time series. Based on data-compression techniques, its

key point is the computation of a suitable measure of the remoteness

of two bodies of knowledge. We present the implementation of the

method to linguistic motivated problems, featuring highly accurate

results for language recognition, authorship attribution, and language

classification.

I myself attempted to publish the following letter on Physics in

several journals, including Computational Linguistics, the Machine

Learning Journal, and Computer Speech and Language:

"Measurement of Gravitational Constant Using Household Items" by

Joshua Goodman, unpublished manuscript.

In this Letter we present a method for measuring the gravitational

constant ("g") using household items, such as old shoes and a

microwave oven. Using the timer on the microwave oven, we were able

to measure the time for the old shoe to reach the ground from several

different heights. We were able to determine that "g" equals 34 feet

per second per second, plus or minus 3. This is within 10% of the

current state of the art in physics. We noticed that our curves were

not quite parabolic, and we speculate that this is evidence of a

wormhole in the region of the experiment.

One of the reviews I received was very positive ("I commend you on

this excellent research. Some might argue that it is irrelevant to a

journal such as ours, but physics, and 'g' in particular, effect us

each and every day, for instance, as we walk or pour coffee.

Furthermore, since almost all readers of this journal have had a high

school (or even college) physics class, and all of us have and use

household items such as these, it is sure to be of interest to the

readership.") Unfortunately, the other reviewers were far more

critical. One reviewer wrote "While it is clear that physicists

should be writing papers about natural language processing and

statistical modeling, this is because they are so much smarter than we

computer scientists are. The reverse is obviously not true." Another

reviewer was even too lazy to read the paper, writing "As everyone

knows, it takes dozens (or even hundreds) of authors to do any good

work in physics. I find it difficult to believe that a single

authored letter is worth reading." And another wrote "I remember from

How come neither of these are cited in this paper?" With comments

such as those, it seems that reviewers in our field are not even

competent to review physics papers, while clearly yours know enough to

review computer science papers.

Obviously, those in my field are not nearly as broad minded or well

rounded as those in yours. I commend you on your openness, and hope

that you will soon expand to cover other areas of interest, such as

economics, psychology, and even literature.

Sincerely yours,

Joshua Goodman

```

Responses to potential apologia...

Q: Why shouldn't a physics journal publish results on computer science? What if physicists find the results interesting and potentially useful to their research?
A: First of all, if physicists find research on classic problems in ENLP interesting, they should be reading the proceedings of the ACL.6 Secondly, a paper describing ENLP techniques useful to physicists could potentially appear in a physics publication, but it should be a survey paper written by someone knowledgeable in the field on ENLP, not a bad "novel" research paper written by clueless nimwits without even the simplest degree of competence in the domain they are studying.

Q: Fine, so this gzip hack is not as good as a simple method. But it works, right? Does that mean it is totally useless?
A: Let me explain why Benedetto et al's results are misleading. Although their description of their experimental design is sketchy at best and muddled by poor English, it seems that they are testing their program on the same data they are training it on. This means that they can make no meaningful statement about its predictive accuracy on novel data.

Normally, (for example, as in J. Goodman's comment) one would reserve 10% of the data set for testing, train on 90% of the data set and then test on the remaining 10%. This design would allow one to state with some confidence the expected performance of the algorithm on novel data.

Let me put it another way: Assume you were doing research and had to classify some data in a domain where you do not know the answers in advance. (For example, you were asked to write a language identification engine for Google.) Would you really trust an algorithm that is 3 times less accurate than the naïve method?

Q: But isn't something about the paper creative or interesting?
A: I would concede that the method is an intriguing one that, like their writing style or research methods, is charmingly outré. Their method is similarly borne out by practice to be crude.

If anything, there is an actual salient (but negative) result proven by Benedetto et al's work. If one considers that their compression method compares entropy on a character-level granularity, about the lowest-level granularity possible, whereas Naive Bayes can be intuitively understood as exploiting entropy at word-level granularity. So, their research shows that the author identification problem is better approached by looking at words than at characters. But you already knew that, when the answer is stated that way. One intuitively assumes that what is unique and identifying about an author is higher level than the probability distribution of character choice, even higher level than usual diction.

Although it would be natural to assume that high-level (syntatic or even semantic, if it were possible) analysis of a text will be more accurate for author identification, results of testing a well thought-out high-level indentification method would be interesting to read. If the method is competitive (results are positive), it would be interesting to know, since it be effective for future problems; If the method is not competitive (results are negative), it would be interesting to know, since this clashes with what one would strongly suspect to be true. Benedetto et al's work was worth doing but not worth publishing, since negative results are only notable if they are unexpected.

You want to see creative and interesting? Then stop believing the hype that AI is a dead field. Rapid, serious progress has been made in the field during the last decade. You just do not hear about it in Wired Magazine unless its a cute hack or an even cuter robot.

Which brings us to an interesting point: Why did Wired Magazine publish the results? If it is because they found the claimed results to be interesting scientific advances (albeit, bogus ones), I suppose they would be shocked to know how effective current AI state-of-the-art actually is.

Second generation algorithms in machine learning are making mincework of tasks once thought of as difficult. These data-driven approaches which are more interested in obtaining highly accurate methods than maintaining vestigial ties to reasoning in a "human-like" way (or, more precisely, the way humans think that humans think). AI's yesteryear reputation for effete fluffiness with soft, squishy results is still pervasive, but we, the ruthless number-driven mercenaries of current cutting-edge AI research could care less when you figure out that AI nowadays can kick your ass and steal your marbles.

It is an exciting time to be involved because there are all these amazing, effective algorithms, which are becoming deeply understood by researchers and quietly and slowly being deployed in the real world by professionals (and, potentially, amateurs). The current AI field is a lush palimpsest that makes Benedetto et al's work pale in comparison.

If you actually go read a serious paper in ENLP (aka EMNLP, Empirical Natural Language Processing, Statistical NLP, i.e. NLP based upon not a linguistic approach but instead treating language as a large data problem), you'll find that it is a fascinating and rich field with many of the algorithms much simpler, more straightforward and elegant than Benedetto et al's compression kludge. ENLP researchers spend their time asking the right questions and structuring their domain so that the answers are straightforward.

1Language Trees and Zipping
Abstract: http://babbage.sissa.it/abs/cond-mat/0108530

2Comment on Language Trees and Zipping
Postrscript: http://www.research.microsoft.com/~joshuago/physicscomment.ps

3Comment on Language Trees and Zipping (Extended version)
Postscript: http://research.microsoft.com/~joshuago/physicslongcomment.ps. or
PDF: http://arxiv.org/pdf/cond-mat/0202383

4On J. Goodman's comment to "Language Trees and Zipping"
Postscript: http://babbage.sissa.it/abs/cond-mat/0203275 or
PDF: http://babbage.sissa.it/pdf/cond-mat/0203275

5Benedetto et al's terminology is a charmingly outré but ultimately colorless and ambiguous idiolect. This, combined with their general attitude of cavalier glossing, makes for confusing prose that clouds the methodology employed by the authors.

6ACL Anthology: http://acl.ldc.upenn.edu/