On producing meaningful html

At first, html started off as a way of writing documents which could contain links to other documents. In fact, the original Enquire written by Tim Berners-Lee bears great similarity to this here website-thingy. In Weaving The Web, Tim explains that html was intended as a structural language, not a presentational one. Unfortunately, a large number of people wanted to do stuff with html that just wasn't provided for.

As html is a structural language, it is not supposed to provide visual layout. At all. The way it's supposed to work is that we indicate a heading with <h1>, meaning that it is the title of a chapter or section. The visual browser then puts it in bold font and adds a line break and some white space; the oral browser pauses before it and reads it out in a heading-type fashion; the search engine crawler says Hah, he put these words in the heading; they must be very important.; and you, dear reader, can write a css rule which only shows the headings on a page in order to provide a quick and nifty table of contents.

But because this is what people wanted and needed, the '90s saw html markup being used in a purely presentational way:

  • <font> tags all over the place.
  • Lots of <i> and <b> tags for a variety of uses.
  • Diverting the <table> tag from its original purpose and using it for layout.

This behaviour is still frequently found today. Using semantic markup is all about trying to make your html structural instead. We shall look at the hows and the whys and then see how all this could apply to e2.

How do I do it?

First, separate form from content. Look at your html and find any stuff which you put there in order to affect presentation. Then, either use different tags, or move anything to do with presentation into some css. This is more difficult than it seems and sometimes requires compromises: some things are technologically possible, but won't work in certain browsers. Other time, there just is no correct solution. Let's look at some cases of presentationally desired elements and the way to go about them.

Italics

There are a number of reasons things sometimes appear in italics and a number of ways of achieving it.

<em>
Use this if you want to do it to make a word stand out. Oral browsers will stress this word in honour of your semantical correctness. Some people will tell you that you should always use <em> for italics rather than <i>. Think about this for a minute. What possible advantage could be gained if we replaced all the <i> tags in the world with the former? In fact, this would be even worse. We would be taking a meaningful tag and using it solely for italics.
<cite>
You know how, in a bibliography, a book title is printed in italics? This isn't because you emphasised the word. It is because you are citing it. It is important not to use <em> for this because otherwise screen readers will say it emphatically – which is not the idea.
<var>
This one should be used for variables, especially in mathematics and in technical documentation about programming.
<span lang="fr">
Another of the uses of italics is for foreign words. These should also not be read emphatically. Describing the language in this way, screen readers could change voice and say them in French. You should associate this with a style-sheet which tells the graphical browser that spans with a language attribute should be in italics. Most other uses of italics, such as prettyfication of quotes and headings, can and should be done with css.
<i>
This is almost always wrong. Telling us that a piece of text should be italics is semantically meaningless. It's purely presentational. A text browser without italics, such as lynx attempts to do something with these, like put them in a different colour. But it might also have chosen to ignore them. I can think of one specific occasion where this is the best option. I'll come back to it later when discussing semantic markup and e2.

Bold

The same things we said of italics can mostly be said for bold, except for the fact that there are fewer tags which natively display in bold. In general, <strong> is an even more emphatic version of <em>. The other use of bold is in headings. These are very useful semantic tags, because they outline the structure of a document: <h1> is the title, <h2> is a subheading, etc. Some people don't use these <hN>, because they show up differently in different browsers. This is the whole point! You cannot predict how people will see your page. So you don't mark it up visually, but structurally. For all other boldings, use css.

Two-dimensional layout

One of the main things which html did not contain at the beginning – and still does not have – is a satisfactory way of laying out a document on a page. A lot of ink has been spilled on this subject so I will try to be brief and to the point. Frames are only a good solution when dealing with multiple documents which need coordinationg, such as javadoc. Tables are meant to store tabular data. This is important: just because you lay something out in two dimensions, it does not make it tabular data. Text browsers and oral browsers could do a good job of representing real tabular data. Unfortunately, so much of the web uses <table> to do layout that little useful work can be done in this field. Sucks to be blind I suppose.

You may notice I'm getting my knickers in a twist over this one. There are a bunch of good reasons that css layout is better than tables. The reason so many tables are used is that coding for layout with tables is tricky and intricate and difficult. And once coders have done their <table> stuff, they are reluctant to hear that their way – a skill which shows how talented they are – is outdated and wrong; they even claim that they can do stuff with tables which they otherwise couldn't do. This is like people who perfected techniques of working with a square wheel rejecting the round wheel, saying that their way is superior because at least a square wheel doesn't roll away when parked on a slope.

Lists

Just think of all the web widgets that are really lists. A navigation menu is a list of places you can go. So is any other list of links. So, in fact, is the list of New Writeups. Again, this sort of thing can be important. An oral browser will announce the number of elements a list contains before beginning to read them out. In a short list, the user might listen to all the options. In a longer list, it is more likely that they will follow the first option that interests them. Remember that there are ordered lists, unordered lists and definition lists. All of these may convey more information than just words separated by <br />.

Other stuff

You could also consider all the other specific html tags, like those for writing <code>, <kbd> (keyboard input) and <samp> (sample output). There is also a bunch of other things which people do which they shouldn't or don't do which they could. In short, each time you write a web document, ask yourself whether you have used the available html to its full potential. Ask yourself whether there is any stuff which is only presentation which would be better in a style-sheet.

Is there any point in being pedantic about this?

The short answer is no! What is important is knowing why you don't care about semantic markup. Each of the following reasons for using semantic markup can be partially debunked.

  • It's the right thing to do.
  • It helps for accessiblity.
  • It helps search engines.
  • It is more versatile.

Just like spelling and grammar are not necessary, neither is writing good html. But there are better ways of authoring on the web; correct grammar and coding are among them. Tim Berners-Lee has a vision of how the web should be. He did create it you know. So do it to please him. Do it to please me.

This may be true, but not entirely. There are a number of things which are not in the domain of semantic markup, and which are even more useful for blind and other disabled people. And, quite frankly, it's a nice sentiment, but in the real world, you can't go around pandering to the 0.0001% of your market share who use oral browsers. On the other hand, if people wrote good websites, you could use oral browsers in cars and other situations where you don't have your hands free.

Unfortunately, googlebot lives in the real world. And in the real world, being number one on google means big bucks. We can therefore safely assume that most search engines do not rely on headings, emphasised text, or any other semantic markup: if they did, people would put lots of spam words in these tags in an effort to rate high on google1.More on searching later when talking about e2.

The last claim is the most important one. Separating content from form is always a good thing. It means you can refactor the form without touching the content. For instance, you could decide that your emphasized text should be in color on a screen, but just ordinary italic for printing. You could change the whole layout of your site in a matter of minutes. Having semantic markup, you could make a list of all the acronyms which you use, along with their meaning; or keep tabs on each time you used a French or Latin expression. You could implement your own search engine, based on the fact that you know that your markup is always meaningful. The possibilities are endless.

The bottom line is: if you care about such things, by all means do something about it. If you don't, you are not alone. But if you want to spew bile at the evangelists, at least have a better, more considered explanation than if it ain't broke, don't fix it. According to the ideals of the web, the current situation is broken and we should do all in our power to fix it.

What about web standards?

web standards, noun A large stick or cudgel, used by the slightly more
anal-retentive to beat the slightly less anal-retentive.

The Devil's Dictionary

Although web standards are very nice and all, they do not contribute directly to semantic markup, just as good spelling does not contribute directly to good grammar. For instance, one of the web standards type things is to note that <i> is a presentational tag and shouldn't be used. This is not enough. For the foreign language example I gave, <span class="italic"> with a css rule would also have worked. But it would have been as semantically meaningless as the basic <i> tag.

In the same vein, it is all very well for the xhtml people to tout <hr /> over <hr>. It does nothing for semantic markup though. A horizontal rule is purely presentational.

Just remember that it is good to strive for both, but that one does not imply (and does not have to imply) the other.

Semantic markup and e2

The main problem we have on e2 with doing semantic markup is the lack of stylesheets. I'm not saying this is a bad thing, as stylesheets are rather powerful and it would be difficult to restrain the usage. Let's have a look at some of the things you can do, things you ought to know are wrong and why semantic markup can help e2.

Headings

One of the things e2 is very well suited for is headings. Use them! As I have said, people don't like them because they can't predict how they will look from browser to browser. Tough Beans! The <h1> tag is for the node title. I like to keep <h2> level for the writeup title if it has one (this one does). Each individual heading in your wu is then <h3>. The disadvantage is that, because of lack of css, a heading automatically implies a new line. If you want the first word of a paragraph in bold, in the real world you would do some css with a heading. Here, <strong> will have to suffice.

Italics and the lack of css

We have seen the tags which should be used for different cases. This is the exception I was telling you about. <i> is basically a <span> which is permanently styled in italics. If the piece you would like to see in italics is neither an emphasized word, nor a variable, nor a citation, use <i>. Consider it bonus formatting for visual browsers which does not impair any of the other browsers. Please don't use <em> for this purpose. This applies to other presentational tags such as underlining, making smaller, making bigger and other such. Just remember to avoid using them on the rest of the interweb.

Blockquotes

Some people like to have a margin around what they're reading. First off, remember that you could add a user style-sheet to see the whole of e2 with a gutter around writeups. Blockquotes should be used for what they are intended: blocks of quotes. Unfortunately, the readability of indented text far outweighs the fact that it is bad markup. Again, please don't do this anywhere else!

Applications

The nice thing about e2 is that we have a large database where an editorial policy is applied. If this policy were to extend to semantic markup, a large number of interesting things could happen. A search engine with very good quality results would be relatively easy to implement. Suppose the <cite> tag were used consistently used whenever you are citing your sources. We could then assume that all these sources were reliable; and that they were the best references on the subject. A quick search through all e2 and what do you get? A sponsored bibliography. All the best books and websites, according to e2. <ins> and <del> tags could also be used to very good effect when editing writeups. It would enable you to look up all recent changes by a certain noder, for instance.


To sum up, semantic markup is a good thing if you have the time and inclination. What is more, here on e2, although we are slightly restricted, the payoffs could be enormous.


  1. Search engines seem to use markup the same way humans do: headings and elements that cause increased presentational weight, such as <strong> and <i>, will raise slightly the weight of the content within said elements. -- http://www.meyerweb.com/eric/thoughts/2004/12/18/ses-chicago-report/. So markup is used, but semantic markup is not.

Log in or register to write something here or to contact authors.