edev: E2 index

by getha

Tue Sep 04 2001 at 10:26:30

This is my first venture into the edev realm...
This idea is a ramble in progress, meaning that it's not all that
clear in my head yet. Subject to change and comments are
very much appreciated...

edev: Everything₂ indexing - a higher level power?

A couple of times I have run into noders lamenting the absence of the ability to search on wu text itself. I have - on occasion - also lamented that same. The problem is of course obvious: huge load on the server/database. The implementation of a text search on wu text will probably never see the light. And rightly so, because it would kill E2 dead in its tracks.

I have lately been playing with an idea, though, that could possibly add greatly to the searching capabilities, while still keeping server and database load at an acceptable level (I think¹).

This idea came forth when researching for a regular node. A book I was using for the research covered
a very much larger topic than what I wanted to node about and on top of that covered its ground in large
portions of uninterrupted text. Very hard to find parts relevant to my subject of interest. At first I
thought the index at the back would be of help, but that was so poorly executed that I soon gave up on
using that book at all (and - temporarily - on the node).

The idea behind an index is very smart, and - if executed correctly - can greatly enhance the usability of a reference work. This possibly also goes for E2.

Assumptions...

If I understand correctly, a search on E2 is done by doing an SQL query on node titles. In fact, it seems the engine does two if a node by the exact search string does not exist (the second doing a 'fuzzy' search to find closely matching node titles²).

(As an aside: I can't find the dbtable where the actual wu texts would reside. None of the dbtables, like 'e2node', 'writeup' or even 'node' have an entry where it could reside, I think, like 'heaven' does with the field 'data' that has type 'text'. Or am I looking in the wrong place?)

The reason that searching on node title is acceptable and searching on wu text is not, would presumably be simple logistics: the title is limited to a maximum of 240 chars, but wu text can be much longer. Besides, a node can contain multiple writeups, so this adds an extra factor to a search on wu text.

An index

First, let me explain what I understand to be an index in the context of this wu. An index in this sense is a list of certain words in a reference work (usually a book) that, either:

records each instance (per page, or even down to the paragraph or line number) of these words in the work
records each instance of a section (chapter, paragraph) of the work that deals - more or less - with the word in question

The purpose of this is of course ease of use. If one wants to find information about a certain topic in a large reference work, one could look up words that are connected to the topic in the index and find all the pages that contain these word or all sections that in some way deal with these words, depending on the implementation used.

Something similar could be a great asset to E2, also.

Acceptable load

But now we come to the tricky part. If we want this in E2, we want this to be implemented in a way that does not add too much strain on the server and the database. It will add to the load. No matter what. That should not be mistaken.

The question therefore is this:

is this at all possible to implement in such a way that the extra load remains within acceptable limits?

I'm not sure. But I think it possibly is.

Implementation

What would the end user see? There are a number of possibilities.

One would be to create a separate search mode just for searching the index. The results then appear on a new 'index search results'-superdoc.

Another would be to add an extra checkbox for including the index in the search. In this case the results would be appended to the usual 'search results'-superdoc.

The results themselves would be presented more or less like they would appear in a normal hardcopy reference work. Which means a list of linked node titles for each search term.

A new dbtable

On a database level this would be implemented by adding an extra dbtable to the database, somewhat similar to a simplified 'links' dbtable, I would imagine. It looks something like this:

index_term - of type char(240)
to_node - of type int(11)
possibly some extra maintenance info

Workings

This would work something like this:

user creates a wu in normal way
before submitting user sets index terms (see below for mechanism and extra comments on this)
on submit for each index term a record is added to dbtable 'e2index'
- index_term = index term
- to-node = node_id of current node (or wu? not sure)
gods and editors can add to or delete from this dbtable directly
if user edits the wu: on re-submit the old index records with its node_id are deleted and the records are added as if new, thus ensuring consistency
if the wu gets killed, so do all the index records with its node_id

Restrictions on searching on index

To minimize the extra load, searching on index terms should only be done on complete index terms, so that the query can be of the form ".. WHERE index_term = 'search string'". This means that index terms need to be chosen intelligently, which ties in with the 'Divide and conquer' heading below.

The mechanism for adding index terms

Easiest to implement would probably be an extra number of inputboxes for adding the index terms. The number would tie in with the allowed number of index terms for the user's level, of course (see below).

Another way is to extend the hardlinking code to handle index terms. Something of the form:

[ flag that this is an index term | index term to use | visible text ] This could then also function as a regular hardlink. This implementation is harder I think, also because the limit on number of index terms has to be checked and caught if not adhered to.

Update: Just remembered. The big drawback (possibly) to the use of seperate inputboxes is that this lends itself to possible abuse (for instance, adding index terms that are unrelated but thought to be often searched for - somewhat like the small sized list of terms on those pr0n sites)

Divide and conquer

This extra feature lends itself for beefing up level powers, as the searching capabilities and index term creation can be graded. For example, the following setup could be used:

Level 2:
- Gain the power to search on index terms using a separate search interface (search is done only on the complete search string)
Level 3 (or some higher level):
- Gain the power to search on index terms using an extra checkbox (only on the complete search string)
- Gain the power to add a restricted number of index terms to each wu
Level (some higher level):
- Gain the power to search on index terms using an extra checkbox (where the search is done using all the separate words in the search string³ in addition to a search on the complete search string)
- Gain the power to add more index terms to each wu

Update September 6, 2001:

I want information about the person Wilhelm Maybach, so I do a regular search. Result: no hits that have anything to do with the man
So then I turn to an index search on the same term. Now imagine that I had been a noder with the powers to add index terms to my wu's. In that case I would certainly have added them for 'Wilhelm Maybach' in both my DMG and Mercedes-Benz wu's. Result: at least two that are very close to the mark

Update October 19, 2001:

JayBonci

¹ That's what this doc's for... Tell me if I'm right or wrong
² Which I think creates a much larger load on the server, too, because this involves using a ".. WHERE title LIKE 'search string'"-type structure in the query instead of the much faster ".. WHERE title = 'search string'"
³ The string is chomped into separate words and the query is done for every (excluding common words like 'and', 'I', 'the' and others) word separately

September 6, 2001

I like it!

(idea)

by yerricde

Thu Sep 06 2001 at 15:30:58

Another possibility: Outsourcing

Perhaps we could enlist the aid of Google in this full-text indexing. Currently, http://everything2.com/robots.txt reads:

User-agent: *
Disallow: /

The administration used to allow Google to spider the site (until E2 started lagging like a bitch). If we were to allow some robots and detect appearance of bot names in the HTTP request's User-Agent field and serve a special theme (similar to printable version except with soft links and a sleep()) that displayed no nodelets or ads, we could let some spiders roam the site without incurring excessive load.

<aside>Would it ease the load on the user dbtable if we deleted fled noders with no writeups?</aside>

I like it!

(idea)

by jaybonci

Thu Sep 13 2001 at 3:36:56

Okay, here's the thing, I'll respond to both of your writeups, since this is too big for a message:

Getha:
This is a good idea, in theory.

First, a technical explaination. Most objects that contain text (writeup, node, user.. for homenode info, superdocs), all inherit type of document. You can see this by going to the nodetype for each of the links above. To get the information for each one we do something like:
SELECT * FROM node LEFT JOIN document LEFT JOIN user
In terms of C++ style inheritance, think of node for the base type of all objects.

Now for a critique of the general idea...There are serveral reasons why a keyword search wouldn't be efficient.

First off, the table would get huge quickly. We have over half a million writeups, and at say three good keywords between a writeup, that's a bunch of keywords. To pull even something close out like that would not be good. We wouldn't manage that very well.
Second, it would be hard to go and back-index the old tables manually (or even run a script) to index the most common words.
Letting users keyword their own writeups wouldn't necessarily be very useful, seeing as we really can't monitor them all. Weighting certain users (the ones with the power), over other users goes kind of against the current power balance.
It would be better if we simply did this as a mySQL supported full index search on the document table. It does it in (presumably) the most optimal way for the db provider. I think we'd need to change the table type from what we currently have, and nate was tossing that idea around a while ago. Manual indexing is expensive, and would cause a lot of lag / downtime.
Playing with the link mechanism is not something we really want to do for a lot of reasons. Miainly because it's a standard ECore thing, and messing with it is bad.

Yerricde: Outsourcing keywords is alright, but that would mean every writeup would have to be accessed at some point, and then constantly hit to make sure it's updated. It's kind of an icky system to have a dumb client like that. Printable version for google has been tossed around, and it's low on the priority list. We don't NEED to index the site right now, but it would be nice. Heavily relying on external resources may not be looked upon favorably by google. Also, the site is meant to be self contained; that one reason why there are largely no external links.

All in all good ideas, but not really feasible with our current setup. We do need to change the db table types and then full index may be possible.

I like it!

robots.txt	What is edev?	Firm Link	The Eleven Satanic Rules of the Earth
pet bee	The principles of nuclear weapon safety and meeting girls are remarkably similar	Soft link	Everything2 Hubs
ecore	Hard link	Google	Scottish Parliament Elections 2007
Curtiss P-40 Warhawk	Wilhelm Maybach	firmlink	edev: Writeuptype bugs, or how nate got stuck between two parentheses
edev: long node title in Random Nodes fix	The Everything Library	printable version	outsourcing
Table Of Contents	User-Agent	Star Ocean: The Second Story	Nodelet

edev: Everything2 indexing - a higher level power?

Assumptions...

An index

Acceptable load

Implementation

A new dbtable

Workings

Restrictions on searching on index

The mechanism for adding index terms

Divide and conquer

Update September 6, 2001:

Update October 19, 2001:

Another possibility: Outsourcing

edev: Everything₂ indexing - a higher level power?