This is a very simple perl module that uses XML::Parser to parse E2 nodes displayed in XML. It is ONLY looking for the actual text of each writeup on the node, because i developed this for use with the Jung-Markov Collective Dream Constructor, which only cares about the content of E2 dream logs, not their authors, title, etc etc.

This probably could have been done more simply or smartly, with XML::Simple or something else, but it does the job.

Also, note that E2 XML documents are technically not well-formed. They contain no root-level container tag (or whatever the official jargon is), so Expat, the library that XML::Parser uses, complains about the validity of the XML and bails. So that's why there's a hack that puts a "<DOC>" tag around the entire input text, in the new method. (Maybe I'm wrong, I'm relatively new to XML - but aren't we all?) I have informed Edev of this, but no one responded...

#  this package is meant to handle XML from Everything2
#  all it does is parse a string of XML looking 
#  for <doctext> elements.
#  access the text that it finds using the doctext method.
#   steev hise, july 2001 . steev at datamassage.com
#  
#
#  copyleft 2001, gnu gpl, etc etc.
################################################

package E2::DoctextXML;
use strict;
use vars qw(@ISA @EXPORT @EXPORT_OK %EXPORT_TAGS $AUTOLOAD %ok_field $VERSION $ERROR $INDEX $DOCTEXT);

use Exporter;
use XML::Parser;

$VERSION = 1.00;
@ISA = qw(Exporter);

# Authorize some attribute fields
for my $attr ( qw( doctext error ) ) { $ok_field{$attr}++; }

sub new {
    my ($class, $input) = @_;
    my $self = {};
    bless($self, $class);		
    $self->{input} = "<DOC>$input</DOC>";  # a hack to fool the parser.
    return $self;
}

# get attributes of the object. this is read only, except for ERROR
sub AUTOLOAD {
    my $self = shift;
    my $attr = $AUTOLOAD;
    $attr =~ s/.*:://;
    return unless $attr =~ /[^A-Z]/;  # skip DESTROY and all-cap methods
    unless($ok_field{$attr}) { warn "invalid attribute method: ->$attr()" };
    $self->{uc $attr} = shift if (@_ && ($attr eq 'error'));
    return $self->{uc $attr};
}


# a method for parsing the XML
sub parse {
    my ($self) = @_;
    $ERROR = 0;	
    $INDEX = 0;

    my $parser = new XML::Parser(ErrorContext => 2, 
				    NoExpand => 1,
				    #Style => 'Debug'
			    );
    $parser->setHandlers(Char  => \&_char_handler);
    $parser->parse($self->{input});
    $self->{doctext} = $DOCTEXT;
    if($ERROR) { $self->{ERROR} = $ERROR };
}

# all this does is put data into
# a hash keyed on the name of the current element
# the hash is just one in an array, indexed by the $INDEX variable
sub _char_handler {
    my($p, $data) = @_;
    if($p->current_element eq 'doctext') {
	    $DOCTEXT .= $data;
	    $INDEX++;
    }
}


###### keep perl happy #####
1;

Log in or register to write something here or to contact authors.