Tables and HTML validation

HTML in E2 writeups is filtered; only a subset of the available HTML tags are actually allowed. There are approximately two reasons for this:

  1. Malice: to avoid allowing untrustworthy users to spoil the E2 experience for the rest of us by using scripts, images, and external links (to, eg. goatse.cx)
  2. Functionality: Scripts or tables can quite easily break an E2 page, either by acccident or on purpose. For example, since layout is performed mostly with tables, a spare </tr> tag could wreak havoc with the layout of the rest of the page, particularly nodelets etc.

Tables would be fine if they could be validated to prove that they're well-formed and hence will not break tables outside the writeup. Tables in particular aren't allowed because:

  • Table validation code was believed to be non-trivial to implement.
  • It was believed that it'd be too expensive to implement in terms of CPU time on the web server

Also, another issue which was not discussed is that allowing tables in writeups might be "exploited" for forcing layout within writeups; eg. to split a writeup into columns, etc.

Having written some fairly simple code to validate table HTML, running some performance tests on it seems to show that the 'worst case' HTML (a nightmare of tables, taken from an entire pageload on Community2) indicates that the table validation takes about a fifth of the time the normal HTML validation takes.

Approach

The design of the table validation code emerges from a slightly different approach than is used for general HTML validation. The existing HTML validation code in ecore strips out disallowed tags to create valid HTML. Transforming the markup in such a way can be fairly expensive, and for tables could be quite complex to perform. Hence, a much simpler approach was adopted based on observations of writeup content.

  1. Most writeups (currently all of them) do not contain any tables whatsoever.
  2. Writeups which contain tables will mostly contain valid tables, and thus require no transformation to ensure valid and innocuous HTML is the result.

Thus the two cases for which performance should be optimised are the corresponding cases of no table tags at all, and tables which are valid and well-formed. Hence the main code merely scans the structure of the tables in the HTML and determines whether tables have valid structure or not. When a malformed table is discovered, the code adopts an entirely different approach (and one which may in the long run be more useful...) of rendering the table tags visible in the displayed HTML, and adding <div> tags with dashed outlines around the elements to aid debugging.

With this approach, the majority of the web server's activity will be in simply checking the validity of any table tags.

Here's the source to the htmlcode:

# Okay, in brief:
# Fast 'cause it's optimised to the 'common' cases:
# Most writeups have no tables.
# Writeups that have tables will mostly have valid tables:
#   => Only a quick parse to validate.
# We 'enforce' the validity of tables by outputting debug info
#   for badly formed tables. This is UGLY so writeup authors will
#   fix 'em quick.
# In an HTMLcode, so compilation of this code is amortised.
# [screenHTML] should still be used, and can be used to control
#   attributes in the tags. Ideally this works on the output of
#   screenHTML, but only because the 'debug' output uses <div>s
#   with dashed outlines to help HTML writers find their oopsies.


# Should be reasonably fast: scans through the HTML using a m''g, which
# is about as fast as anything in perl can be. Stacks the tags (only
# looks at table tags) and checks the structural validity by 
# matching a two-level context descriptor (stack . tag) against
# an RE describing valid contexts. (again, perl and RE => faster than
# a bunch of ifs or whatever)
sub tableWellFormed ($) {
    my (@stack);
    # Note that htmlScreen ensures that the HTML input to screenTable will
    # only ever terminate the tag name with a space or a closing >, so this
    # is all we have to match.
    for ($_[0] =~ m{<(/?table|/?tr|/?th|/?td)[\s>]}ig) {
        my $tag = lc $_;
        my $top = $stack[$#stack];

        if (substr($tag, 0, 1) eq '/') {
            # Closing tag. Pop from stack and check that they match.
            return (0, "$top closed with $tag")
              if pop @stack ne substr($tag, 1);
        } else {
            # Opening tag. Push, and check context is valid.
            push @stack, $tag;
            return (0, "$tag inside $top") 
                if (($top.$tag) !~ /^(table(tr)?|tr(td|th)|(td|th)(table))$/);
        }
    }
    return (0, "Unclosed table elements: " . join ", ", @stack)
        if ($#stack != -1);
    return 1;
}

sub debugTag ($) {
    my ($tag) = @_;
    my $htmltag = $tag;
    $htmltag =~ s/</amp;lt;/g; # should be encodeHTML, but of course
                            # I don't have that in my standalone testbench.
    $htmltag = "<strong><small>amp;lt;" . $htmltag . "amp;gt;</small></strong>";

    if (substr($tag, 0, 1) ne '/') {
        return $htmltag . "<div style=\"margin-left: 16px; border: dashed 1px grey\">";
    } else {
        return "</div>". $htmltag;
    }
}

sub debugTable ($$) {
    my ($error, $html) = @_;
    $html =~ s{<((/?)(table|tr|td|th)((\s[^>]*)|))>}{debugTag $1}ige;
    return "<p><strong>Table formatting error: $error</strong></p>".$html;
}

my ($text) = @_;
my ($valid, $error) = tableWellFormed($text);
$text = debugTable ($error, $text) if ! $valid;

$text;

A similar approach might yield speed improvements to htmlScreen too. It should be possible, for instance, to construct a single regular expression to determine if some HTML consists entirely of valid tags and attributes (and this regular expression can be constructed automatically, of course). However, since there's a large base of existing writeups that may have invalid tags and attributes that nobody has noticed (since the existing htmlScreen provides no real feedback), it's unlikely that this will be a particularly palatable idea.

As a last word, caching of computed results in ecore is something that could be done better, and more consistently; but we all know this by now, and it's late in the evening so I'm not going to tell you things you already know.

Y'all can try it out over at

http://community2.org?node=test+screenTable

or

http://kahani.org?node=test+screenTable

In particular, I want y'all to try and break it, defeat the validation to get unpleasant HTML through the validator. And remember that in the final version it will be used in conjunction with htmlScreen... ;)