15 September 2002: see the below wu.

I made this for work. It's an HTML/JavaScript 'thing' designed to take all the nasty stuff out of Microsoft HTML, as generated by Microsoft Word, among other things.

It's tested to work on Internet Explorer 5.5. The assumption is that if you have Word, you will have the aforementioned browser.

All you have to do is create a file with the below contents, save it as an HTML file, open it, and follow the simple instructions on the screen. It will delete <DIV> and <SPAN> tags, and leave in (I'm afraid) the XML-style mess created by indexes, table of contents, and other stuff. It should be fairly easy to customize (or correct :-) if you know what you're doing. I have absolutely no idea what will happen to diagrams created in word using the Microsoft Draw-alike.

Because I haven't figured out how to make regexps span abitrary numbers of line-breaks, you'll still have some ''style'' attributes in the HTML unless you check ''Remove line-breaks''. But if you do check it, you won't be able to edit the results without further regexp-fu. Sigh...

If there are any stupid bugs, please tell me. If you have any constructive criticism (i.e. other than 'your code SUCKS, fool!' (I know)), please tell me.

Enjoy. the headache


<html><head><title>DeMSHTML</title><script>

var mshtml="";

function doParse() {

    var regexpTagsToDelete = new Array( "div", "span", "!", "o:", "/o:" );
    var normalTagsToDelete = new Array( "</div>", "</span>" );
    var tagsToDeMunge = new Array ( "p","b","i", "br")

    //this is the bit that causes
    //your browser to grind to a halt.
    mshtml = document.forms['jsRep']['html'].value

    if( document.forms['jsRep']['checkLine'].checked == true ) {
        var re= /\n|\r/gi

        mshtml = mshtml.replace( re, " " )
    }

    execTagRegExp( tagsToDeMunge, false )

    execTagRegExp( regexpTagsToDelete, true )

    for( var i = 0; i != normalTagsToDelete.length; i++ ) {
        deleteStr( normalTagsToDelete[i] )
    }

    document.forms['jsRep']['html'].value = mshtml

}

function execTagRegExp( tagsToFind, deleteTag ) {
    for( var i = 0; i != tagsToFind.length; i++ ) {
        var re = new RegExp( "<" + tagsToFind[i] + "[^>]*>", "gi" );
        if( deleteTag ) {
            mshtml = mshtml.replace( re, "" );
        } else {
            mshtml = mshtml.replace( re, "<" + tagsToFind[i] + ">" );
        }
    }
}

function deleteStr( strToDel ) {
    var lastIndex = 0;
    var nextIndex = 0;
    var strToReturn="";
    var lenStrToDel = strToDel.length;

    while( (nextIndex = mshtml.indexOf( strToDel, lastIndex ) ) != -1 ) {

        strToReturn += mshtml.substring( lastIndex, nextIndex )

        lastIndex = nextIndex + lenStrToDel;

    }

    strToReturn += mshtml.substring( lastIndex, mshtml.length );

    mshtml = strToReturn;

}


</script></head><body>

<p>Enter your text here...</p>

<form name='jsRep'>

<textarea name="html" rows="20" style="width:100%"/></textarea>

<p>...and then click this button:</p>

<p><input name='goButton' value="demunge" type="button" onclick="doParse()"/></p>

<p><input name='checkLine' value="remove line-breaks" type="checkbox" /> Remove line-breaks</p>

</form></body></html>

For bigger documents, it is best to use Mozilla. Tested with 0.9.3. and 1.1 IE5.x just crashes with that much input, whereas Mozilla just slows down horribly. For bigger documents, use sed or something.

Also, this doesn't convert all the special characters like smart quotes and stuff.

The best way to de-bastardize Microsoft HTML (MS-HTML) or any crappy HTML is to use the wonderful open source, W3C approved program HTML Tidy.

http://www.w3.org/People/Raggett/tidy/
http://tidy.sourceforge.net/

Tidy can now perform wonders on HTML saved from Microsoft Word 2000! Word bulks out HTML files with stuff for round-tripping presentation between HTML and Word. If you are more concerned about using HTML on the Web, check out Tidy's "Word-2000" config option! Of course Tidy does a good job on Word'97 files as well!

To use from a command line, just add --word-2000 yes

Log in or register to write something here or to contact authors.