authorial attribution - Everything2.com

Or, the Pointless Pseudonym

Did Shakespeare really write Shakespeare? Or did Marlowe? Or how about Bacon? And those suspicious letters to your great-great-grandfather, signed "kiss-kiss, Queen Victoria?" Priceless epistolary artifact, or fraudulent rubbish?

The question of authorial attribution is often a difficult one to answer. The world has been duped before, with Hitler diaries, false speeches, and other forgeries. Likewise, the occasional unsigned postscript or essay pops up, supposed to have been penned by a Milton or Swift, and in the 19th Century, a significant portion of journalism went without by-line. The quest of authorial attribution specialists--some call themselves forensic linguists--is to devise ways of examining the works of an unknown author and either matching them to a name, or disproving the supposed connection.

It's extraordinarily difficult. There are a number of methods and practices, usually best employed in combination. What follows is a brief explanation of just a few of them, perhaps more interesting for their approach than the accuracy of their results.

But first--throw out any ideas of carbon dating the paper and such. This isn't the Shroud of Turin we're talking about. That works to find out if the timing is right, but means nothing in terms of the words written. All you have is the text and a few surrounding details. Publication date, company, journal, etc. Most of the time, you're obviously not going to have the original, anyway.

External Evidence

A luxury in this field. If a researcher is lucky--and willing to spend outrageous amounts of time hunting through reams of documentation in order to make that luck--provenance will come in the form of a credible outside reference.

In the instance of an essay thought to be by Milton, one might find it referenced by him in another of his works, already accepted as authentic. As with the following made-up example:

Dear Andrew: You dull-witted clod, you've gone and sent my essay on the value of hitting Catholics with bricks to the printer's before I could sign it. I fear you shall never live up to your name.

~John Milton

This is platinum to a text authenticator. An author claiming ownership always lends worth to an argument.

Alternately, the work may be attributed to the author in question by a contemporary or scholar to whom the benefit of the doubt might just be given--as in this also obviously false example:

March the 4th-sent Milton's essay on hitting Catholics with bricks to printer without signature. Blind old fool wrote me nasty note. Tomorrow, will slightly rearrange his furniture without telling him.

~Andrew Marvell

Would that it were always so easy. The corroborating source may not be so close to the subject (Marvell was Milton's assistant at one point), and then of course one must consider the authenticity of the reference itself. If the first instance is platinum, this is gold. But the precious metals analogy typically breaks down the further away in place, person, and time you get from the target.

Linguistic Analysis

If it looks like Milton, and smells like Milton, it's Milton...right? Wrong. Or rather, maybe. The intuitive aspect to authorial attribution is not to be discounted--it's often a good enough place to start--but it certainly won't hold up in court. Any modern student of Milton's style, provided he could get his hands on some three-hundred-year-old paper and ink, might make a decent forgery that would pass a first or second glance. It takes stronger stuff.

But this is where things start getting very tricky indeed, and an in-depth explanation requires a knowledge of statistics your humble narrator does not possess. For my own sake, I'm going to keep it simple.

Theorists suggest that an author, as an individual mind with an individual idiolect, will consequently have an individual style, and what forensic linguists call a wordprint. These same linguists have developed several strategies of textual analysis to lift that print--and word frequency is of huge importance. All methods of counting involve taking samples of the author's known works, to start with. The samples and sample sizes (usually given in number of words) differ depending on the textual format; taking both essays and poems will likely skew the results, and the size should be consistent text to text. In addition, a sample should be small enough to remain manageable, and large enough to be representative. One might also decide that sticking to narration is the best course, discounting dialogue (in the case of novels, for example), as it is usually highly stylistically deviant from the overall authorial voice. As I said, it's complicated.

Methods of counting include (but are not limited to):

Single words: One has the suspicion that Milton uses the word "bricks" with alarming frequency. Also "brimstone," "darkness," etc. One goes through the selections and counts them.
Sequences: Instead of just "brimstone," which you're finding too be entirely too common, you use "fiery brimstone," or "oppressive darkness." Phrases that repeat within an author's work, that you think make the author stand out.
Collocation: How often words appear near each other--can show you the connections an author is making in his mind, and so might be an excellent indicator of a wordprint. If "bricks" and "Catholics" keep showing up within a few words of each other, you're in business. There are a lot of caveats here. A fifty-word span probably reveals a lot less than a four-word span, but a six-word span might show a great deal more than four.

Once you have done your counting of known texts, and feel you have a fairly strong wordprint, you can get started. If you plan on comparing a text to an author's body of work, you must first be sure you can tell that author's body of work from everyone else's. You're going to have run the same counting tests across the board and make sure the other authors aren't running around in Milton-masks. If your tests don't distinguish Milton from Donne, there's little point in using them to attribute our unclaimed essay. This is something like a police lineup--you gather the usual suspects, and try to pick out the one you're looking for.

Right. You've taken three works from each of five authors, and run the numbers. In an ideal universe, the works would all "cluster" together on a chart or graph according to author. Milton's stuff would be grouped close together, and distinguishably far away from Donne's. An IDEAL universe. Bear in mind ours is anything but.

If Fate is on your side, you have something you can use as a wordprint. Time to throw the unknown essay into the mix. Run the same tests, and if the essay clusters with Milton, you have linguistic evidence of its origin. Highly contestable evidence, but more than you had before you started.

Linguistic Analysis +

In case you're wondering who on earth has time to tally all this, the answer is a lot more people than had time thirty years ago, before computers were set to the task with specially designed software programs. Now, sifting through 100,000 words, generating collocation parameters and cluster graphs is remarkable less painful. Moreover, whereas you aren't going to count the frequence of every word in not just one book but dozens, a computer will.

And this is where things get really interesting.

The computer will give you a list of the most frequently used words over all--not words you told it look for as a result of foregrounding, intuition, or however else your own brain has muddled things--but just what's really in the text. Not surprisingly, you typically get a lot of function words at the top, such as:

the
at
that
and

Etc., etc. Fairly innocuous stuff with which anyone in their right mind wouldn't even bother. However, the linguist John Burrows did bother, and eerily enough research indicates that these words actually WORK. You can get a wordprint based on function words--the theory being (roughly stated) that an author's use of such things is so deeply wired and habitual, that the author has little or no stylistic control over their usage. It's done virtually automatically.

In a million years, I would never have thought this, or that usage of such common words could differ enough from one writer to the next to make a difference. What are the implications? That the system works regardless of what and for whom you are writing. Letter to your Mom or Harlequin Romance novel, they might all end up in the same cluster, or damn close. You cannot hide.

The rest is for computational linguistics and Noam Chomsky to decide--but on the surface at least it appears our brains are, as usual, less under our conscious control than we would like. Right down to the letter.

Accuracy and Effectiveness

Another sticky wicket. From my very limited experience, it seems more time is spent trying to figure out if the analytical methods described above work at all than is spent on applying them. The number of conditions that have to be met in order to proceed is also dauntingly high. Likewise, the testing methods must be done in broad strokes. If a four word collocation span doesn't cluster authors properly but eight does, you have to try both and everything in between, and ultimately decide when things have gotten too broad to be meaningful. The top ten most frequent words? The top 800? You run tests to decide which tests to run, and it won't work for everybody. It's nowhere near an exact science.

That being said, sometimes you get success. In 1913, LaSalle Corbet Pickett published a book alleged to contain letters from her husband, Confederate Army General George Pickett, of Pickett's Charge fame. The letters entered the American Civil War canon, becoming part of serious study. Nonetheless, suspicions lingered, and when put to the tests above, the letters didn't group with material we know to have come from General Pickett. Indeed, they're a lot closer to his wife's other works. The letters included in her book are now largely thought of as her own compositions.

So who wrote Shakespeare? I have no idea.

Proof:
Holmes, David I., et al. "A Widow and her Soldier: Stylometry and the American Civil War." Literary and Linguistic Computing, v. 16, no. 4. Oxford University Press, Oxford. 2001. 403-420.
Hoover, David. "Statistical Stylistics and Authorship Attribution: an Empirical Investigation." Literary and Linguistic Computing, v. 16, no. 4. Oxford University Press, Oxford. 2001. 421-444.
And notes from the above's lecture of April 9th, 2003

Plagiarism for profit and prestige	Noam Chomsky	attribution	Alex the Parrot
The Franklin Prophecy	Seasoning a cast iron pan	Great Moments in Criminal Forensics	Andrew Marvell
Pickett's Charge	Hitler Diaries	Who wrote Shakespeare?	Shroud of Turin
bat on a sticky wicket	The Unabomber's Manifesto	William Shakespeare	Unabomber
Computational linguistics