The given substitution only does a halfhearted job on arbitrary
HTML documents, for several reasons:
- HTML tags are case insensitive, and tags in upper case will be missed by the substitution
- this can be addressed by adding a i, but if you want to make tag substitutions and preserve case, it becomes trickier
- HTML allows whitespace between tokens
- most of this can be remedied by adding [ ]* in the appropriate places
- HTML elements can span multiple lines
- some nontrivial vi (let alone ex) wizardry is required to repair this
- HTML tags are nestable, so the expression will go wrong on things like <a href="foo"><img src="bar.png"></a>
- regular expressions cannot express all matching on a context-free language in theory, but it's possible to hack around this by using temporary replacement strings and working from the inside out in a loop
- syntax errors in the input may lead to the strangest results
- a missing < can ruin a lot of content without you even noticing
How to get around this:
- use your very own MyHTML language
- a subset of HTML on which the vi substitution above does work properly; of course you'll now have to guarantee that all your input is in this form
- use a proper HTML parser that actually builds a parse tree and even has some nice heuristics for dealing with the real-world HTML out there
- HTML::TreeBuilder, libxml2 and tidy come to mind
For instance, arbitrary input can be 'normalized' to a standard format by a parser, and then manipulated by tools that take advantage of this normalized form, as long as the output remains in that form.