Plucene is a Perl port of the popular Java-based document indexer Lucene. As of April 29, 2007, it is part of the E2 servers.

Plucene was developed by Kasei as a standard port and follows Lucene's implementation fairly closely. Its primary functions are to:

  1. Index documents, including their text and metadata.
  2. Retrieve documents based on search criteria.

Currently it's at version 1.25. Kasei was also gracious enough to develop the Plucene::Simple module, which as its name indicates is a simplified wrapper of Plucene.

How It Works

It's pretty simple, actually. It's a file-based system, so the first thing you do is create an "index", which is a directory on your webserver dedicated to holding Plucene data, like so:

my $newIndex = Plucene::Simple->open("/path/to/your/index");

Next you add documents to this index. Each document must have a unique ID to be returned by searches. You can either just add the text of a document:

$newIndex->index_document($uniqueID => $document_text);

or you can get more detailed and add metadata, like the author and such:

$newIndex->add(id_1 => {author => "kthejoker", title => "Plucene", text => "Plucene is a Perl port ..."});

text is a specfically reserved keyword acting as the primary document text when choosing this more elaborate index method.

Once you're done adding documents, it's time to search. The search returns an array containing the unique IDs that match your search criteria. You can search just on the document text, like so:

my @results = $newIndex->search("Plucene"); #returns id_1

Or you can search on the metadata fields, by using the format fieldname:search:

my @ results = $newIndex->search("author:kthejoker"); #returns id_1

And that's pretty much it. You can also delete documents from the index, check to see if a document has been indexed, and some other obvious functions, but that's the nitty gritty of it.

Using Plucene at E2

Today can only be considered an unqualified success by yours truly. At the onset of the day I knew exactly zilch about CPAN, Debian, apt-get, Plucene, and command-line Perl. And yet I somehow managed to download and install Plucene and get it up and running on E2 without completely melting down the system (yet.)

Installing Plucene on a Debian-based system (like E2's Ubuntu servers) is as simple as running the command

apt-get install libplucene-perl

Plucene::Simple is just a single .pm file, so I manually uploaded it to all of the Plucene directories on the 4 E2 webheads. I ran into my first obstacle when I learned that creating an index requires creating a directory and locking it, something which requires sudo power, which is unavailable from within E2. So I had to figure out how to get an instance of E2 running from the command line and then use my sudo powers to create the index and insert the first 600 writeups from E2's vaunted history into it.

Luckily everything seems to be in order (so far) and you can view the results at Full Text Search Beta. It's a pretty robust searching mechanism, too: it can do wildcards ("soft*" returns things with "software" and "softener" in them), boolean logic ("Brian AND Eno" versus merely Brian Eno) and even quoting ("Brian Eno" is different than Brian Eno). It's pretty spot on with your standard search, but that's what makes it great. It works out of the box, doesn't require a whole lot of tweaking or customization, and (most importantly) it gives results. I'm sure that as we begin indexing more writeups the search will become slower, but hopefully it'll still be bearable even at high levels of indices.

In the meantime, I'll be working to integrate it with our current main search box mechanism for better results. 

 

Log in or registerto write something here or to contact authors.