A practical methodology for tag based search engines

This is the story of five people on the Internet who are bored.

But first, a digression.

As anyone who has spent 20 minutes on the Internet can tell you, it is often harder than it looks to simply find genuinely amusing and enteraining things, but when you aren't looking for anything in particular, you can come across all sorts of amusing and entertaining things. Nobody can explain this phenomenon. It's very Heisenbergian in nature, but more or less it comes down to: the web is dominated by hierarchy.

Go to Google, type in "humor" and hit I'm Feeling Lucky. You get collegehumor.com, an admittedly humorous site ... for college age people, who by and large dominate and populate the Internet of 2005. Type in the word "fun" and you get FunBrain.com, an edutainment site for young children. "Entertainment" leads to the E! channel's online portal. "Amusing" goes to AmusingFacts.com. And perhaps in the most on-the-nose result is "funny", which takes one straight to Funny.com, a hideously-rendered HTML-based site replete with animated GIFs, an outdated copyright statement, and a smorgasbord of jokes ranging from the truly funny to the groaniest of puns.

None of these sites are inappropriate to the search - certainly you'll find humor at College Humor, and (some slight) semblance of entertainment at E!'s website - but is that what you were really looking for? What's more important is that Google's claim is that these sites are the most relevant to your (admittedly generic) search, that these will somehow fulfill your search and make you happy. Google has IPOed a lot of money based on this claim.

Is there a better way?

Our Five Bored People

Alan - A 33 year old computer programmer from London. Watches "Star Trek" religiously. Quotes Tolkien at parties. Has Joss Whedon's office on speed dial. Personal hygiene: unexemplary.
Betty Lou - A 64 year old grandmother of five. Loves knitting sweaters, baking cakes, and the American flag. Votes for whichever candidate looks most like Clark Gable.
Che - A 22 year old political science graduate candidate at UC-Berkeley. Loves silkscreening T-shirts, getting baked, and burning the American flag. Rarely goes a day without triggering Godwin's Law.
Ditsy - A 16 year old high school sophomore from Miami. Has self-generated over 30 different personalities from Cosmo quizzes. Uses the word "like", like, all the time. Major crush on Josh Hartnett. IQ: deprecated.
Evan - A 28 year old patent attorney. Sidelines in an art-punk band on the weekends. Has live-in model girlfriend named Bianca. Reads Pynchon. Gets the jokes on Frasier.

(Any similarities to people you know is really, really sad, unless they're an Evan, in which case please post some more Bianca pics to Flickr.)

Our five people sit down at del.icio.us or Furl or some social bookmark tagging taxonomocially-endowed mechanism™. They are told to visit all of their favorite websites except the ones which are humorous in nature, and bookmark them accordingly. Tags fly, left and right. You can probably guess what many of these tags are, although they will certainly surprise you some of the time. Eventually they are finished, and we tell them to type in "humor" at our new tag based search engine and hit I'm Feeling Lucky.

What will happen?

What Should Happen

In a perfect world, our computer will be able to know our favorite things and filter the Internet through those things (without compromising our privacy or security.) The tag based search engine is the beginning of this step. It will give an "Alan" flavor to Alan's web search.

So, let's start with what should happen. If you or I, the practical and omniscient search engine, were going to pick out the best website for "humor" for these people, what would they be? Let's imagine, in fact, that there are only 5 humorous websites on the web (some might argue that's a stretch in itself.) They are:

College Humor (http://www.collegehumor.com/)
Get Your War On (http://www.mnftiu.cc/)
HaLife (http://www.halife.com/)
SlashNOT (http://www.slashnot.com/)
The Onion (http://www.theonion.com/)

Now let's also assume that we asked 1,000 visitors of these sites to "tag" these sites for others to find. In addition to every visitor tagging their respective site as "humor", we saw these trends emerge:

SlashNOT - 209 tagged it "nerd", 86 tagged it "geek", and 43 tagged it "science." No other tag got more than 20 mentions.
HaLife - 416 tagged it "family", 282 tagged it "inspirational", and 61 tagged it "clean." No other tag got more than 20 mentions.
Get Your War On - 292 tagged it "comic", 263 tagged it "anti-war", 114 tagged it "political", and 53 tagged it "obscene."
College Humor - 703 tagged it, predictably, "college", 513 tagged it "stupid", and 112 tagged it "pictures."
The Onion - 491 tagged it "news", 318 tagged it "satire", 203 tagged it "writing", and 63 tagged it "parody."

Words such as funny, fun, humorous, and entertainment are omitted from our list for the time being. So our final list of potential tags, in alphabetical order, are:

anti-war, clean, college, comic, family, geek, inspirational, nerd, news, osbcene, parody, pictures, political, satire, science, stupid, writing

Now back to our 5 bored people.

When a tag has met a certain threshold in a user's personal tag collection - when it is considered a dominating tag - it will be considered by the search engine for relevance. For our sake, we'll say this threshold is 5. Thus, for Alan, his dominating tags are "star_trek" (15), "computers" (10), and "nerd" (8).

To perform the search, the computer actually performs a number of searches. It begins with the most dominating tag, appends the user search to it, and searches under that rubric. Then it continues down the line, until it runs out of dominating tags, and then just searches under the main search itself. It throws out all duplicate sites in the process.

"star_trek humor" returns 0 results.
"computers humor" returns 0 results.
"nerd humor" returns 1 result.
"humor" returns 4 results.

As the system retrieves results, it assigns them a relevance quotient. This quotient is determined mathemetically with the formula

a(TAG_user)(TAGMATCH_site) + b(TAGSEARCH_site)

Where a and b are unknown variables, TAG_user is the number of times the user has used the dominating tag, TAGMATCH_site is the number of times that site has been tagged with the dominating tag by others, and TAGSEARCH_site the number of times that site has been tagged with the search query itself. For now, we will assume a and b to be 1.

In this case, when the search engine comes to "nerd humor", it applies the formula, and gets

(1)(8)(209) + (1)(1000) = 2672

for SlashNOT's relevancy quotient to Alan's query.

When it comes to "humor" it gets

(1)(1)(0) + (1)(1000) = 1000

for each of the other sites.

It then sorts the sites by relevancy quotient, and displays them for Alan, who sees SlashNOT on top, which we could probably assume would be the best choice of the 5 for Alan.

For Betty Lou, her dominating tag of "family" gives her HaLife; for Ditsy, "pictures" gets her College Humor; for Che, "political" gets Get Your War On; for Evan, "parody" gets him The Onion.

Conclusion

In the new folksonomy, a number of advances will help push the march of letting users tag websites for others. Perhaps the most useful aspect of this new search engine will be the ability to generate personalized RSS feeds. Then as new websites emerge and make their way up the ladder, they'll appear on the RSS feed for digesting. In addition, there could be sections on the website or within the feed for "New Sites" which could highlight sites or blog posts which have relevance to the query, but not necessarily a high relevancy quotient due to to their newness or obscurity.

Obviously, this system is just a methodology - the numbers aren't sticky. Raise the threshold to 10, or, better yet, let the user set their own threshold - or ignore their personal tags altogether. You can alter the relevancy quotient formula by squaring any of the parts of it. Perhaps the most interesting idea that could be applied is using the actual data itself to alter the values of a or b, or to generate exponents for the formula. Creating a formula that redefines itself based on the data would mean the search engine could be left to run indefinitely - no maintainers or tweakers.

All in all, this concept could be part of the new Semantic Web in a meaningful broadcasting way. Links could be tagged to be user-specific, or clique-specifc, or even website-specific - all things E2, such as the Catbox Archives, the node trackers, and the node backup utility could all be tagged "everything2", or everything I write here could be tagged with "kthejoker", all in order to make things easier to find by myself and by others. Imagine an RSS feed of all your writeups and you'll begin to see the possibility of a tag based search engine.

Godwin's Law	folksonomy	del.icio.us	Search engine optimization
Furl	Small pieces loosely joined	Semantic Web	perfect number
The Onion	Frasier	Web 2.0	LibraryThing
power law	Get Your War On	Star Trek: The Animated Series	Heisenberg Uncertainty Principle
oi	overflow