e2link: Create everything2 links from word lists (thing) by lenz

e2link is a small (well, not so small) utility written in Perl that is useful to automatically create hard links from word lists. I originally created it quite a while ago in order to help me node How the Internet Came to Be; then I left it for a while in my home node and now I decided to release it.

The idea behind e2link is quite easy: it will parse a word list and create everything2 hard links (i.e. add brackets) in a text of yours when a match is found. It is smart enough to correct the capitalization of the word in your text file with the correct one taken from the word list and is able to handle pipe links, i.e. links that don't link their own literal string. In addiction to this, it won't put links where you already put them by hand.

I think it can be useful to node long texts full of technical jargon or group of nodes having the same subject, therefore making more hard links and with better quality. To use it, first you have to create a text file to submit to e2 (we'll call it txtfile.txt) and a word list file (we'll call it wordfile.txt). A word list file has a single word (or phrase) per line, you can add comments starting the line with '#' and leave white lines to better suit your needs. In the case you want to create a pipe link, you just enter the pipe link in the file, e.g you separate two phrases with the pipe symbol.

An example word list would look like this:

# This is a sample word list
everything|Everything2
E2|Everything2

pancakes|My secret pancake recipe

John Doe
Doe

Note how the longest match must go first, as in the case with "John Doe" vs. "Doe". If your original text file was something like

Mr John Doe likes e2 very much. 
His wife Jenna Doe likes pancakes better.

This would be the result:

Mr [John Doe] likes [Everything2|E2] very much. 
His wife Jenna [Doe] likes [My secret pancake recipe|pancakes] better.

To invoke the script, just type

perl e2link.pl txtfile.txt wordfile.txt > e2file.txt

In addition, if you want to, you can control the maximum number of times a word is linked by adding an optional third parameter stating the maximum number of istances of a word that will be linked.

perl e2link.pl txtfile.txt wordfile.txt 3 > e2file.txt

Means that any word present in wordfile will not be linked more than 3 times.

If you already linked that word, manual links will not be counted. Instances are substituted starting from the beginning of the file (thanks to KissThis and liveforever for the idea and support).

Have phun, and let me know of any bugs you should encounter.

Prerequisites

Perl 5 installed on your machine. If you are on a Linux/Unix box, it's probably already installed. If you are on a Windows machine, you can download a Perl version for free from http://www.activestate.com
A shell installed. bash via Telnet or xterm will do on a Unix machine; if you run Windows you can get a shell by typing COMMAND or CMD after selecting Start/Run...

History:

version 1.1, as of 2003-10-16: added max number of substitutions for words
version 1.0, as of 2003-10-15: initali public release.

To Do:

Sort word list so the longest matching string goes first (isogolem)

Perl source code follows.

use strict;
#
# e2link.pl
#
#  - version 1.1 - 20031016
#
# this program nodes an input text given a keyword
# and substritution file
#
# there is a shadow file, parallel to the one we're
# editing, that is used as a lowercase search pool
# and to protect already substituted text areas 
# (they appear as a string of '-'s in $shadow)
#

# use:
#  perl e2link.pl txtfile.txt wordfile.txt > e2file.txt
#
# or:
#  perl e2link.pl txtfile.txt wordfile.txt 2 > e2file.txt
# so that no word will be substituted more than two times
#

# this script is free software. originally by [lenz]

# docs @
# http://www.everything2.com/index.pl?node=
#        e2link:+Create+everything2+links+from+word+lists


my $txtfile = $ARGV[0];
my $wordfile= $ARGV[1];
my $maxsub  = $ARGV[2] || 9999999;
my $txt;
my $shadow;
my $r;


# loads textfile into $txt and initializes
# $shadow with 
open F, $txtfile or die "$! opening text file '$txtfile'\n";
while (<F>) { $txt .= $_; }
close F;
$shadow = lc $txt;


# do the subst game
# while reading word file
open F, $wordfile or die "$! opening word file '$wordfile'\n";
while ( $r = <F> ) {
    my $nodename; my $nodedest;
    next if $r =~ /^[\s*$|#]/;
        
    chomp $r;
    
    if ( $r =~ /^(.+?)\x7C(.+?)\s*$/ ) {
        ($nodename, $nodedest) = ($1, $2);
    } elsif ( $r =~ /^\s*(.+?)\s*$/ ) {
        ($nodename, $nodedest) = ($1, "");
    };
        
    subst( $nodename, $nodedest, $maxsub );

}
close F;

print "\n$txt\n";
exit;


sub subst() {
    my ($nodename, $nodedest, $maxsubst) = @_;
    
    #
    # This function substitutes $nodename (even if in [brackets])
    # with a suitable combination of $nodename and $nodedest
    #
    
    my $ln = length( $nodename );
    my $char_pre; my $char_post;
    my $alphalist = "abcdefrghijklmnopqrstuvwxyz01234567890";
    my $lcnodename = lc $nodename;
    my $donesubst = 1;    # number of substitutions 
                          # done for this word
    
    if ( $ln == 0 ) { return; };
    
    my $pos = 0;
    
    
    while ( (($pos = index( $shadow, $lcnodename, $pos )) > 0) 
            && ($donesubst <= $maxsubst) ){
        
        $ln = length( $nodename );
        
        # check boundary chars: is this match a part of a bigger word?
        $char_pre = substr( $shadow, $pos - 1, 1 );
        $char_post= substr( $shadow, $pos + $ln, 1 );
        
        if ( ( index( $alphalist, $char_pre) > -1 ) ||
             ( index( $alphalist, $char_post) > -1 ) ) {
             	$pos++;
             	next;
        }
        
        # check if already enclosed in square brackets
        
        if ( ( $char_pre  eq '[' ) &&
             ( $char_post eq ']' ) ) {
            $pos -= 1;
            $ln  += 2;
        }
        
            
        if ( length( $nodedest ) > 1 ) {
            repl( "[$nodedest|$nodename]", $pos, $ln );
        } else {
            repl( "[$nodename]", $pos, $ln );
        }
        $donesubst++;
    }
}


sub repl() {
    my ( $node, $start, $len ) = @_;
    #
    # this function replaces $len chars starting at $start with 
    # $node in the global $txt and with a sequence of ---'s
    # having the same length as $node in $shadow
    #
    
    my $it = substr( $txt, 0, $start );
    my $ft = substr( $txt, $start + $len  );
    my $is = substr( $shadow, 0, $start);
    my $fs = substr( $shadow, $start + $len  );
    
    $txt    = $it . $node . $ft;
    $shadow = $is . ( '-' x length($node) ) . $fs;
}

Small helpful scripts for noders	Notelet Nodelet Squawkbox Client	Interface design	E2 Explorer
How the Internet Came to Be	User interface design for night time	The three golden rules of user interface design	Notelet nodelet HTML tricks
I am an impediment. I am here to get in your way.	Perl	E2 Link and Logger Client	A
Software is free speech	Every day, in every way, I strive to be a danger to myself and others	Hard link	Band name generator
UL	Small Soldiers	Links on Everything2	March 16, 2006
October 15, 2003	The Definitive E2 Weekend Spectacular Hull Breeching Extravaganza!	URL escape sequences

e2link: Create everything2 links from word lists (thing)

Page category: