Strip anchor tags from the current line (in vi, or sed, or just a generic regular expression)

The breakdown: s/foo/bar/ search for foo anf replace with bar. the //g means replace with nothing, the g meaning do it everytime, even more than once per line (globally) <a : the first part of the anchor tag.
[^/>] = a character class that doesn't match the '>' character. Anything but the right angle bracket, since we're going to want to stop at that point.
* = zero or more of these.
> = the right angle bracket that ends an HTML tag.

I wrote this node so I never have to resort to usign sticky notes for regexp commands (in my atrocious handwriting). And I'm glad I did, cause my Palm III blew up. I had to give it a lobotomy.

So put it all together in regexp logic, and you get:
find a <a, erase anything you find until you get to the >, erase the >, and quit erasing, go look for another one on the same line.

Of course, this will fail if there are any PHP or ASP tags in the anchor tag.

The given substitution only does a halfhearted job on arbitrary HTML documents, for several reasons:
HTML tags are case insensitive, and tags in upper case will be missed by the substitution
this can be addressed by adding a i, but if you want to make tag substitutions and preserve case, it becomes trickier
HTML allows whitespace between tokens
most of this can be remedied by adding [ ]* in the appropriate places
HTML elements can span multiple lines
some nontrivial vi (let alone ex) wizardry is required to repair this
HTML tags are nestable, so the expression will go wrong on things like <a href="foo"><img src="bar.png"></a>
regular expressions cannot express all matching on a context-free language in theory, but it's possible to hack around this by using temporary replacement strings and working from the inside out in a loop
syntax errors in the input may lead to the strangest results
a missing < can ruin a lot of content without you even noticing

How to get around this:

use your very own MyHTML language
a subset of HTML on which the vi substitution above does work properly; of course you'll now have to guarantee that all your input is in this form
use a proper HTML parser that actually builds a parse tree and even has some nice heuristics for dealing with the real-world HTML out there
HTML::TreeBuilder, libxml2 and tidy come to mind

For instance, arbitrary input can be 'normalized' to a standard format by a parser, and then manipulated by tools that take advantage of this normalized form, as long as the output remains in that form.

Log in or register to write something here or to contact authors.