In computational linguistics, one area of study is the production of methods for automatic grouping and classification of texts. This may involve identifying the author, the body of knowledge, or the actual language that the text should be identified with. As it turns out, the humble unix compression utility, gzip, can provide us with an effective tool in this endeavour.

When we look at two texts, we have an intuitive notion of what we could call the "semantic distance" between them - two works by Shakespeare are obviously (to us) more closely related than, say a work by Shakespeare and one by Theonomist. We use subtle (or not so subtle) clues in the language to make this judgement.

So an interesting problem for computational linguistics is to find ways for a computer to make similar judgements. To do this, we need to have an algorithmic method for evaluating semantic distance. Often, word and phrase counting is used for this purpose, and can be quite effective, because different authors, different bodies of knowledge, different languages, all have their own statistical signatures, unique (ish) patterns in the relative frequencies of different textual elements.

In a recent paper1, the authors use gzip to create a computable measure of semantic distance, based on what the authors call the relative entropy between two corpora (collections of texts, that is), roughly as follows.

Suppose we have two corpora, A and B. The standard notation for the length of A is |A|. We use gzip to compress A, producing the file we'll call z(A). Since gzip is a compression tool, it's very likely that |A| > | z(A) |, which is just to say that the length of A is greater than the length of z(A), the compressed version.

Now we take our other collection, B, and take one text from it. We'll call this 'b'. We stick b on the end of A, call this A+b, and compress that, giving us z(A+b).

Now we can look at the difference in length between z(A) and z(A+b). That's the simple sum | z(A+b) | - | z(A) |, which we can notate as DAb.

So far so good.

Now do the same for B and B+b, giving us another simple sum: | z(B+b) | - | z(B) |, which we can call DBb.

We can now use the difference between these two figures to estimate how "semantically close" b is to A.

Here's why: suppose A is a collection of Latin scholarship, and B is the text of all the nodes by theonomist. When we use gzip to compress A, it works roughly by building a dictionary of the most common textual elements in the texts comprising A, and then outputting this dictionary and a set of references into this dictionary which when resolved give us back the original text. Because the texts in A have a basic similarity, the dictionary created will tend to become 'optimal' for this kind of text. But the nodes of E2 scholar theonomist will tend to use different textual elements, so when we use the dictionary appropriate to the Latin scholars to compress theo's text, we won't get such efficient compression. In other words, we'd expect DAb to be larger than DBb.

So we can use the sum DAb - DBb to estimate the overall similarity of the patterns of textual elements in b to those in A. Since the actual figure will depend also on the length of the texts we start with, we can just divide the whole thing by DBb, to even things out, giving the formula (DAb - DBb) / DBb.

We can also do the opposite sum, (DBa - DAa) / AAa, using a text 'a' from A, to estimate its closeness to theo's nodes, comprising B.

Adding these two together, we get a generic formula for the 'relative entropy' between our two corpora, A and B, which is written SAB:

SAB   =   (DAb - DBb) / DBb   +   (DBa - DAa) / DAa

Using these formulae, the authors of the paper referenced above achieved some pretty startling results. Using texts by great Italian authors from past centuries, they attempted to identify the author by looking for the lowest distance (DAb - DBb) / DBb between the text in question (b) and the body of the author's other work (A), producing the following table:


    Author            Texts       Identified correctly

    Alighieri           8               8

    D'Annunzio          4               4

    Deledda            15              15

    Fogazzaro           5               4

    Guicciardini        6               5

    Machiavelli        12              12

    Manzoni             4               3

    Pirandello         11              11

    Salgari            11              10

    Svevo               5               5

    Verga               9               7

    ---------------------------------------

    totals:            90              84

In all but one case, wherever the author was not identified correctly, taking the next closest 'distance' produced the correct result.

Not satisfied with this, they applied their other formula (for SAB) to over 50 Euro-Asiatic translations of the Universal Declaration of Human Rights (possibly the most widely translated document in the world) to compile a "distance matrix", containing all the distances SAB for each distinct pair of translations.

Feeding this matrix into an algorithm normally used for constructing phylogenetic trees, they were able to produce a 'language tree' that exhibited the usual groupings made by linguists: Romance, Celtic, Germanic, Finno-Ugrian, Slavic, Baltic and Altaic. In line with orthodoxy, their method isolated the Basque language and the Afro-Asiatic language Maltese (the latter isolated, presumably, since they were considering no other African languages).

And all done with gzip!


1. Language Trees and Zipping
Dario Benedetto, Emanuele Caglioti, Vittorio Loreto
http://xxx.lanl.gov/abs/cond-mat/0108530


Regards the writeup below, perhaps I should clarify the issues as I see them..

Firstly, I don't think it is seriously in doubt that the technique of Benedetto et al. can produce the results described above. I feel that this alone is enough to absolve me for bringing the results to the attention of the noding public. (In case it has escaped attention, the title of the writeup places the emphasis on gzip, not on "semantic distance", etc.)

Secondly, it is in error to expect a close fit between the significance of scientific work as seen from the journalistic and scientific points of view. Everyone knows the press are generally hopeless at science (I hope!)

Finally, I am forced to admit chronic amusement at the fact that (the no doubt eminent) Joshua Goodman's letter, complaining (at length, though without benefit of the adventurous typography seen below) about the propriety of publication, was itself rejected.

In my view, Benedetto et al.'s result, while it may not be a significant advance in its field, has some intrinsic interest, is worth publishing, and seems not to have been published before. Perhaps if those readers who were upset by its publication did not attach so much importance to it themselves, the unfortunate controversy described below would not have occured. Even our (now anonymous) critic below admits his ire is inspired more by the authors' response to Goodman's outraged critique than by the actual subject of my writeup. This outrage, it seems, is directly proportional to the professional vanity and self-importance of the outragee. Dare we say boundary posturing?