Using gzip to do computational linguistics (idea) by everyone

STOP! THE ABOVE WAS A TRICK WRITEUP!
JerboaKolinowski, you should know better than to bait them with that tripe...

Before proceeding with showing you how the above research is lower than a "cool hack" and closer to "poisonous disinformation", let me explain the origin of my venom. While I could stomach Benedetto et al's "Language Trees and Zipping",¹ reading it as physicists poorly but earnestly playing computer scientists, I could take no more after reading Benedetto et al's facile and, moreover, smug reply to Joshua Goodman's open letter on their faulty metholodogy, where they write: "the concept of entropy now belongs to Computer Science (sic !)" [sic]. (Yes, you didn't misread that, they actually call it "Computer Science (sic !)")

Let me translate that for you, as the last three words embody the tone of their "research" letter: "Computer science is an actual science? You mean we cannot ignore the fundamentals of experimental research design and get away with our childish fucking antics? Says you!"

How dare you impugn the validity of computer science as a scientific disclipine?

You, you miserable worms, who cannot even conduct proper and meaningful research!

For that, I will tear your arms off, starting ... now:

The most important flaw in their research can be inferred from the above writeup by even those with little background in the field of NLP. Examine the experimental design, as well as the results presented. Benedetto et al would like you to think that their algorithm did "well." Well compared to what? Apparently, compared to their intuitive notion of what "well" is on a data set of their choosing.

The accuracy of an algorithm for any sort of classification makes no sense by itself on one specific domain, especially a simple domain with an absurdly low number of test values. You cannot say an algorithm in a vacuum is "good". "Good" compared to what? It derives its validity from contextualization, namely by comparison to a baseline algorithm. A baseline algorithm is a simple, straightforward algorithm to solve the given problem. An algorithm is "good" if it is competitive with other algorithms for the same problem or, if the algorithm solves a novel problem, at least if the algorithm is competitive with the baseline algorithm.

Here is how the experiment would have proceeded if Benedetto et al knew even the slightest thing about computer science:

Implement the compression kludge algorithm.
Implement the baseline algorithm, say Naive Bayesian Classification.
Test both algorithms on your made up data set with twenty test instances.
Obtain 95-99% accuracy with kludge algorithm. Obtain 99% (or perhaps even 100%) accuracy with baseline algorithm.
Discard made up data set for difficult, large data set used by other people researching competitive algorithms.
Disappear in a cloud of smoke from trying to reconcile your fixed notions about computer science with the recognition of the existence of serious research in computer science.

Joshua Goodman, a respected, established researcher in NLP who works at Microsoft Research, (which is strong despite Microsoft's generally poor software development), wrote "Comment on Language Trees and Zipping"², a reply to Benedetto et al's letter. Goodman's incisive, eloquent, and restrained letter was rejected by Physical Review Letters.

"I first point out the inappropriateness of publishing a Letter unrelated to physics. Next, I give experimental results showing that the technique used in the Letter is 3 times worse and 17 times slower than a simple baseline, Naive Bayes. And finally, I review the literature, showing that the ideas of the Letter are not novel. I conclude by suggesting that Physical Review Letters should not publish Letters unrelated to physics."

I highly recommend anyone interested in the scientific method and experimental research design read the extended version³ of his comment.

More excerpts of the salient portions of this comment:
(emphasis mine)

Physics journals should not be publishing articles that have nothing to do with physics. It is ... completely reasonable to publish the use of non-physics techniques applied to physics in a physics journal. But this paper applies computer science techniques (gzip!) to computer science problems. This seems extremely inappropriate for a physics publication.
...
Given this argument, I don't think it is really matters whether or not the paper is a good paper--the point is, quite simply, that a good paper that has nothing to do with physics does not belong in a physics journal. As it happens, the paper is a bad paper. First of all, there are many many well known techniques in Computer Science for solving problems of this sort, some complex and some very simple. I decided to try the simplest, easiest to implement, most straightforward algorithm I could think of, an algorithm known as Naive Bayes. I implemented both the zipping algorithm and a Naive Bayes algorithm, and applied them to a similar, but standard problem, 20 Newsgroups, the goal of which is to determine the newsgroup from which a posting came. The zipping procedure is not even that much simpler: it takes about 70 lines of perl script to write the code to split the data into test and training, and an additional 50 to use the zipping method, versus an additional 70 to implement Naive Bayes (a difference of 20 lines in trivial.) The zipping method was 17 times slower and had 3 times as many errors as this simplest standard computer science algorithm.
...
Furthermore, the ideas of this paper are very well known in areas of computer science such as machine learning and statistical natural language processing. ... Even this compression idea dates back (at least) to 1995, when Ken Lang and Rich Caruana tried out the idea for doing newsgroup classification. In the end though, they ended up using Naive Bayes classifiers. They didn't bother to publish the compression idea because they thought it was better viewed as an interesting thought than as a serious classification method. Still, the idea of using compress got around a bit ... Admittedly, however, this technique is not that widely known, because computer scientists don't typically try anything that crude -- in a couple of hours (or less), we can build from scratch tools that work better than this.
Of course, this explains another reason why phyics journals should not be publishing computer science papers: they don't know the field, or the reviews, and so cannt distinguish the good from the bad, this paper being a case in point.
Why am I so bothered by this paper? Well, a big part of it has to do with the amount of press coverage it has received. The authors sent out a press release that got published in Nature Science Update (see http://www.nature.com/nsu/020121/020121-2.html) and Wired Magazine (see http://www.wired.com/news/technology/0,1282,50192,00.html) and picked up by people such as the ACM (Association for Computing Machinery) (see http://www.acm.org/technews/articles/2002-4/0208f.html#item14), who perhaps should have known better than to trust stuff from a physics journal, but made the mistake of assuming that physicists were competent to review the paper. When reputable publications ignore the existence of computer science, and assume that those without computer science training are well qualified to do research and review publications in the area, it hurts the field by allowing outdated or wrong information to be circulated ... It is also insulting.

Benedetto et al could have slunk off and saved some face, but instead penned a retort to J. Goodman's letter, entitled "On J. Goodman's Comment to [sic] Language Trees and Zipping." ⁴

Their reply is either a humorous serious response or a terrible joke. Its only genius is its eerie ability to simultaneously appear to be an academic troll and mere dead earnest stupidity, and its illustration of Abraham Lincoln's adage: "It is better to keep your mouth shut and to be thought a fool than to open it and to remove all doubt."

Briefly:

Benedetto et al begin by saying that "it is beyond [their] judging possibilities" to decide whether a paper like theirs should be published in a physics journal. "If then one considers that publication on [sic] Physics Review Letters is subject to a tough peer review probably we could not be blamed for having attempted this submission."
"Except for learning that the concept of entropy now belongs to Computer Science (sic !), [sic] there is one important point to stress:" They then proceed to describe a general situation in which their paper's treatment of entropy may be loosely applicable to physics.
They set out to debunk the claim that their algorithm performs poorly. They download the data set used by J. Goodman in his comment and then call into question J. Goodman's ability to duplicate their results in a different (but totally congruent) domain, as they "have never published how this technique works in this case." (Numnuts, it doesn't take a supra-genius to figure out how to apply the algorithm to a similar problem!) They then describe what seems⁵ to be the exact method of the original paper.
"It is important to stress that Goodman is criticizing our paper in a very odd way. He claims that our technique fails on an experiment (the subject classification) that was not described in our paper. ... It is worth to stress [sic] how on much more controlled and clean corpora, the efficiency of our method raises considerably." (You puerile, tiny souls, listen again: Most naïve algorithms can accurately classify easy data sets, even one as naïve as yours. You can't just test on a toy domain of your choosing and declare yourselves smart, it doesn't work like that.)
They vaguely describe their experimental method, which does not include implementing Naive Bayes and testing it themselves to ensure homogenity of testing methods. From the little one can gather from their numerical results, Benedetto et al set up their experiment differently from Goodman's (including but not limited to treating the training data as the test data) and yet compared their calculated accuracy to J. Goodman's calculated accuracy for Naive Bayes in his differently structured tests. Not surprisingly, Benedetto et al's compression method, tried with each of three different compression routines, slightly outperformed Naive Bayes.
"It would be interesting for instance to know how Naive Bayesian methods perform in the set of problems we have considered in [On Language Trees and Zipping]. ... In this case we can guess that the method would suffer the lack of long training texts and this could also be true for the problem of authorship recognition where the number of texts available per author is very limited. (Instead of wondering or guessing how Naive Bayes performs on small data sets, why don't you implement it and find out? Hint: It works better than zipping.)

By far, the most exciting part of their reply is a pseudo-scandalous republishing of an errant J. Goodman draft. This gem is the most compelling argument against Benedetto et al's form of blasé hand-waving research, but considering that his staid and specific second letter went unpublished, it is little wonder Goodman realized that Benedetto et al and the Physical Review Letters cronies would not get it. Ever more ironic is that Benedetto et al consider this a great outing of Goodman.

From the appendix to their reply:

(The letter appeared only for a few days on the site: http://research.microsoft.com/~joshuago/ and then mysteriously disappeared, substituted by a more serious comment sent to the Editors of Physical Review Letters as well as to cond-mat archives.)

Date: Tue, 12 Feb 2002 15:17:12 -0800
From: Joshua Goodman 
To: benedetto@mat.uniroma1.it, caglioti@mat.uniroma1.it, loreto@roma1.infn.it
Subject: Open Letter to the editors of Physical Review Letters

An open letter to the editors of Physical Review Letters

Dears Sirs,

I wish to commend you on your open-minded interdisciplinary approach

to science. In particular, I was extremely impressed that you

published an article on Computational Linguistics and Machine Learning

in a journal ostensibly devoted to physics ("Language Trees and

Zipping" by Dario Benedetto, Emanuele Caglioti, and Vittorio Loreto,

28 January 2002, (available at

http://babbage.sissa.it/abs/cond-mat/0108530) with the following

Abstract:

 In this Letter we present a very general method for extracting

 information from a generic string of characters, e.g., a text, a DNA

 sequence, or a time series. Based on data-compression techniques, its

 key point is the computation of a suitable measure of the remoteness

 of two bodies of knowledge. We present the implementation of the

 method to linguistic motivated problems, featuring highly accurate

 results for language recognition, authorship attribution, and language

 classification.

I myself attempted to publish the following letter on Physics in

several journals, including Computational Linguistics, the Machine

Learning Journal, and Computer Speech and Language:

"Measurement of Gravitational Constant Using Household Items" by

Joshua Goodman, unpublished manuscript.

 In this Letter we present a method for measuring the gravitational

 constant ("g") using household items, such as old shoes and a

 microwave oven. Using the timer on the microwave oven, we were able

 to measure the time for the old shoe to reach the ground from several

 different heights. We were able to determine that "g" equals 34 feet

 per second per second, plus or minus 3. This is within 10% of the

 current state of the art in physics. We noticed that our curves were

 not quite parabolic, and we speculate that this is evidence of a

 wormhole in the region of the experiment.

One of the reviews I received was very positive ("I commend you on

this excellent research. Some might argue that it is irrelevant to a

journal such as ours, but physics, and 'g' in particular, effect us

each and every day, for instance, as we walk or pour coffee.

Furthermore, since almost all readers of this journal have had a high

school (or even college) physics class, and all of us have and use

household items such as these, it is sure to be of interest to the

readership.") Unfortunately, the other reviewers were far more

critical. One reviewer wrote "While it is clear that physicists

should be writing papers about natural language processing and

statistical modeling, this is because they are so much smarter than we

computer scientists are. The reverse is obviously not true." Another

reviewer was even too lazy to read the paper, writing "As everyone

knows, it takes dozens (or even hundreds) of authors to do any good

work in physics. I find it difficult to believe that a single

authored letter is worth reading." And another wrote "I remember from

my coursework something about Newton's Method and Gradient Descent.

How come neither of these are cited in this paper?" With comments

such as those, it seems that reviewers in our field are not even

competent to review physics papers, while clearly yours know enough to

review computer science papers.

Obviously, those in my field are not nearly as broad minded or well

rounded as those in yours. I commend you on your openness, and hope

that you will soon expand to cover other areas of interest, such as

economics, psychology, and even literature.

Sincerely yours,

Joshua Goodman

Responses to potential apologia...

Q: Why shouldn't a physics journal publish results on computer science? What if physicists find the results interesting and potentially useful to their research?
A: First of all, if physicists find research on classic problems in ENLP interesting, they should be reading the proceedings of the ACL.⁶ Secondly, a paper describing ENLP techniques useful to physicists could potentially appear in a physics publication, but it should be a survey paper written by someone knowledgeable in the field on ENLP, not a bad "novel" research paper written by clueless nimwits without even the simplest degree of competence in the domain they are studying.

Q: Fine, so this gzip hack is not as good as a simple method. But it works, right? Does that mean it is totally useless?
A: Let me explain why Benedetto et al's results are misleading. Although their description of their experimental design is sketchy at best and muddled by poor English, it seems that they are testing their program on the same data they are training it on. This means that they can make no meaningful statement about its predictive accuracy on novel data.

Normally, (for example, as in J. Goodman's comment) one would reserve 10% of the data set for testing, train on 90% of the data set and then test on the remaining 10%. This design would allow one to state with some confidence the expected performance of the algorithm on novel data.

Let me put it another way: Assume you were doing research and had to classify some data in a domain where you do not know the answers in advance. (For example, you were asked to write a language identification engine for Google.) Would you really trust an algorithm that is 3 times less accurate than the naïve method?

Q: But isn't something about the paper creative or interesting?
A: I would concede that the method is an intriguing one that, like their writing style or research methods, is charmingly outré. Their method is similarly borne out by practice to be crude.

If anything, there is an actual salient (but negative) result proven by Benedetto et al's work. If one considers that their compression method compares entropy on a character-level granularity, about the lowest-level granularity possible, whereas Naive Bayes can be intuitively understood as exploiting entropy at word-level granularity. So, their research shows that the author identification problem is better approached by looking at words than at characters. But you already knew that, when the answer is stated that way. One intuitively assumes that what is unique and identifying about an author is higher level than the probability distribution of character choice, even higher level than usual diction.

Although it would be natural to assume that high-level (syntatic or even semantic, if it were possible) analysis of a text will be more accurate for author identification, results of testing a well thought-out high-level indentification method would be interesting to read. If the method is competitive (results are positive), it would be interesting to know, since it be effective for future problems; If the method is not competitive (results are negative), it would be interesting to know, since this clashes with what one would strongly suspect to be true. Benedetto et al's work was worth doing but not worth publishing, since negative results are only notable if they are unexpected.

You want to see creative and interesting? Then stop believing the hype that AI is a dead field. Rapid, serious progress has been made in the field during the last decade. You just do not hear about it in Wired Magazine unless its a cute hack or an even cuter robot.

Which brings us to an interesting point: Why did Wired Magazine publish the results? If it is because they found the claimed results to be interesting scientific advances (albeit, bogus ones), I suppose they would be shocked to know how effective current AI state-of-the-art actually is.

Second generation algorithms in machine learning are making mincework of tasks once thought of as difficult. These data-driven approaches which are more interested in obtaining highly accurate methods than maintaining vestigial ties to reasoning in a "human-like" way (or, more precisely, the way humans think that humans think). AI's yesteryear reputation for effete fluffiness with soft, squishy results is still pervasive, but we, the ruthless number-driven mercenaries of current cutting-edge AI research could care less when you figure out that AI nowadays can kick your ass and steal your marbles.

It is an exciting time to be involved because there are all these amazing, effective algorithms, which are becoming deeply understood by researchers and quietly and slowly being deployed in the real world by professionals (and, potentially, amateurs). The current AI field is a lush palimpsest that makes Benedetto et al's work pale in comparison.

If you actually go read a serious paper in ENLP (aka EMNLP, Empirical Natural Language Processing, Statistical NLP, i.e. NLP based upon not a linguistic approach but instead treating language as a large data problem), you'll find that it is a fascinating and rich field with many of the algorithms much simpler, more straightforward and elegant than Benedetto et al's compression kludge. ENLP researchers spend their time asking the right questions and structuring their domain so that the answers are straightforward.

¹Language Trees and Zipping
Abstract: http://babbage.sissa.it/abs/cond-mat/0108530
²Comment on Language Trees and Zipping
Postrscript: http://www.research.microsoft.com/~joshuago/physicscomment.ps
³Comment on Language Trees and Zipping (Extended version)
Postscript: http://research.microsoft.com/~joshuago/physicslongcomment.ps. or
PDF: http://arxiv.org/pdf/cond-mat/0202383
⁴On J. Goodman's comment to "Language Trees and Zipping"
Postscript: http://babbage.sissa.it/abs/cond-mat/0203275 or
PDF: http://babbage.sissa.it/pdf/cond-mat/0203275
⁵Benedetto et al's terminology is a charmingly outré but ultimately colorless and ambiguous idiolect. This, combined with their general attitude of cavalier glossing, makes for confusing prose that clouds the methodology employed by the authors.

⁶ACL Anthology: http://acl.ldc.upenn.edu/

If it hasn't happened to you in Unix, it will	Law of entropy as it applies to music	When discussing philosophy, please don't mention The Matrix	RFC 1952
Computational linguistics	Stupid Unix Tricks	Touch the Puppy	clientdev: Homenode List Generator
Universal Declaration of Human Rights	probable word attack	How many bits are in the human genome?	int main(int argc, char *argv[])
Unix Family Tree	gzip	Wharfinger was no Shakespeare	We are in The Matrix
Luigi Pirandello	phylogenetic taxonomy of life	boundary posturing	Dumb Science experiments
Bayesian Network	NLP	clientdev: How to predict your position in the Other Users nodelet	The Man Felt an Iron Hand Grasp Him by the Hair, at the Nape. Not One Hand, a Hundred Hands Seized Him, Each by the Hair, and Tore Him Head to Foot, the Way You Tear Up a Sheet of Paper, Into Hundreds of Little Pieces