Whats wrong with HTML?

Technically a markup language, a lot of the problem with HTML is that it shows no evidence of prior design, and only recently have any attempts been made at standardisation (via the W3C compliance marks). Vendors producing software using HTML have been free to add their own enhancements or additions in order to make their own products more desirable. Usually these new tags gain only marginal uptake and only add to incompatibility. Often new proprietary tags added as a result of these browser wars are of dubious value and all things considered, HTML can be considered at best a bad headache; more realistically a severe mess.

So objectively speaking, that HTML is merely a language to entertain the w4r3z d00dz and not a serious choice for the creation of online resources?

1. Lack of metadata
Without a doubt the biggest singular flaw is that a markup language designed primarily for rendering online content has no capabilities for storage of metadata. Perhaps acceptable at a time when the web numbered only a few pages, it now constitutes a severe mistake that can clearly be observed when using any search engine.

Simply described, metadata is a description of the type of content data represents. Good markup languages not only store the content, but also describe in a transparent way precisely what this content is.

Metadata is without a doubt a good thing:

  • Searching is easier as content is defined by type, in practice 'find me quotations by...'
  • Rendering is easier by clients and the markup is portable. Data-types clients do not understand can be omitted and the user informed. Version migration is therefore greatly eased
  • Details such as author and creation date are an enforced and mandatory part of the syntax - giving accountability to the document and helping to judge it's authenticity
And before someone points out, no <meta-equiv> tags do not forgive HTML's grievances with regards to metadata. They marginally improve keyword-searching and are victim to horrific keyword whoring to try and glean more search hits off popular search terms.

2. Presentational details are part of the content
Another serious mistake with any markup language is to mix content with information about how the content is to be displayed. The usual end result is either:
  • Content tied to a particular platform or medium
  • Multiple versions of the same document for said various platforms
  • Content conforming to the lowest common denominator, making little use of advanced features of any platform
HTML not only encourages but forces you to include your presentational details with your content. End-user doesn't like flashing pink text on a yellow background - tough. It's very difficult for browsers to correct for design aberrations by allowing any way of customising the layout: How do you make titles and text normal colours when you don't know what's a title and what's body-text, see above regarding metadata or lack thereof.

Marking in presentation with content also makes the document useless on many platforms. This is somewhat stupid for a markup syntax designed specifically to work on a large network of widely varying platforms. Enclose your text in that novel <blink></blink> construction and watch all of the netscape wielding world go blindly unaware of it's existance.

Fortunately this issue has been somewhat resolved by the coupling of HTML with CSS, which have helped cross-platorm rendering of text a lot. This doesn't however excuse HTML of fault - requiring an extension lanuage to perform a primary function is not a sign of sane design.

3. Lack of standardisation
W3C vs Microsoft vs Netscape vs The User.
Over the space of only a few years about four-gazillion standards evolve and none of them ever become 100% compatible, even now. 'nuff said.

4. Awful syntax enforcement
HTML at it's inception didn't actually start out with a bad syntax. Everything would be coded within angle brackets, <TAG> and then be closed by a corresponding </TAG>. Nesting would be perfectly permissible and indeed be part of the structure of a document, eg the BODY tag would be part of the HTML tag.

And of course it would be a plain text, 8-bit clean, user editable source. Nice, egalitarian and simple. Seemingly a good and robust standard with clear and simple rules about syntax. Of course this is not the case.

Creeping featurism added at first only a few new tags, some not requiring end tags. Amperand (&) escaped line-feed characters and special symbols quickly followed, sometimes requiring the trailing semicolon, and sometimes not.

Next you could get away without closing tags that did need a termination - but only on some platforms. Hey, why even bother with <P> at all?

End result - insanity when a client choked on a missing tag, chosing from that point onwards to render the document in an increasingly erratic value - remember pages that suddenly start to warp towards the right due to a missed tag, requiring you to continually scroll horizontally to keep up with them?

5. Low proficiency for usage (of a complex standard)
Don't get me wrong, im not trying to be elitist - there's nothing wrong with having a simple language usable by anyone to publish content (including your gran). However HTML is not simple - its a behemoth of various tags, many of which are completely useless.

Such a sheer volume encourages a more is good philosophy to layout, where a novice tries to include as many as possible in order to prettify sic. a page. End result is a garish nightmare of flashing, singing and dancing pages. Now with DHTML not only can you annoy a user with your bad taste in colours and fonts, you can also jump the page around, shrink it and make the screen shake - but that would be another writeup.

HTML was almost complete in it's first design - it didn't really need any more tags. Some should even have been removed from the standard IMHO, especially the notorious <FRAME> tag, which really should be neutered and all traces of it removed.

6. Ambiguity of creator
Finally HTML promotes no concept of clear accountability as to whom the author of the content is, where to send follow ups to and when the document itself was last updated. Almost all online forms for distribution of data enforce this. When you unpack a tarball to your machine it has an AUTHORS, an e-mail has a RETURN-TO in the header (most of the time), newsposts carry a follow-up-to adress. HTML doesn't enforce any of this as part of either a written or informal standard.

End result is millions of pages left to rot as nobody can be found responsible for their maintanance. The biggest argument agains the internet, probably the greatest information resource, is the lack of authenticity of the material - often a good way of deciding on it's accuracy or potential bias. Everything is good in this respect in that it places clear emphasis on who wrote the content and when.

Perhaps I'm being unreasonable to level criticism at HTML in this respect? Whilst many of these problems are human rather than technical, I do feel a language designed for the web should have perhaps priorly considered them - what would be wrong with a mandatory <AUTHOR> tag?

7. But...
Despite all of this HTML has one redeeming feature that makes it palatable and extremely popular; <A HREF="www.go...">

It's not really surprising BT tried to patent the hyperlink. Its a damned fine idea and makes the web what it is.

Linking pages together and providing a clear and easy method of navigation is an enlightened idea. Without a doubt the reason the web was such a sucess is the ease in which a user can move form site to site. Only with increasing volume of data have search engines begun to supercede the hyperlink in the primary source of generating traffic for your website.

In the light of this you can't help but like HTML, even though it is flawed in every way. From design to use it's an apauling mess, and even though superior solutions such as XML exist, it's likely to remain with us for some considerable time.