As a person who worked on the HTML
engine inside of Microsoft Office, there is a large reason why there is so much bloat inside of a file generated by MSO
. The point of saving a document to HTML
is document data
fidelity, and document display fidelity. When you are talking about Office
in this case, you mean Microsoft Word
, as each product has it's own conventions
for saving as HTML
The HTML output by Microsoft Office
is about 90% compatible with Netscape Navigator
, with a few exceptions
- Embedded movies and media tend to end up as <img dynsrc=... which is an IE only specification. They should end up as <embed src=... but that's a personal gripe I had.
- Some meta tags mean nothing to Netscape
- Some of the CSS and other positioning things only matter in IE 5.0 and above.
- OLE objects tend to be a little weird
I took a Microsoft Word 2000 Document
and saved as HTML
. The entire contents of the document were "Hello World
", in the true style of CS.
Here's what I got:
You'll see this XML all over the place. This is for future compliance and for use in the Microsoft XML parsing engines. You could also script against yourself this to see what HTML you'll be seeing. It's useful for anyone wanting to parse the information.
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<meta name=ProgId content=Word.Document>
<meta name=Generator content="Microsoft Word 9">
<meta name=Originator content="Microsoft Word 9">
These are your standard meta tags, except ProgId. This is a COM name identifier that tells what kind of document you're looking at. Therefore if you open up a Microsoft Word generated HTML file in Excel, it will know not to parse it, but to launch Word to handle that. The charset in the http-equiv is telling what codepage to be looking at for the character set. Word is incredibly picky about the fonts and characters.
<link rel=File-List href="./Hello%20world_files/filelist.xml">
Ahh, our first bit of strangeness here... You will notice that this is a relative path to something that simply doesn't exist. (It is ignored in this case) It would however exist if you had embedded files in the document, or any images. This retrieves a list of everything that the word doc contained (whether it be an OLE object, a file that just isn't visible, etc). When you save out to HTML, with images, you notice that there is a folder oftentimes created with the document named "documentname"_files. This contains all of those items. Here, because the doc is so simple, this isn't an issue.
Your title is right there. Nothing strange at all
<!--[if gte mso 9]><xml>
<o:Company>Manifest Research Visions, Inc.</o:Company>
Word keeps a section of the document properties, including who wrote the document, lines, when it was created, etc. Whenever you save as HTML, each of these items are kept in this XML section inside of the document. They are used to describe the document and preserve the information across file format change.
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
font-family:"Times New Roman";
mso-fareast-font-family:"Times New Roman";}
margin:1.0in 1.25in 1.0in 1.25in;
This is non-standard CSS to preserve some of the items that CSS doesn't describe, but obviously Word needs. For instance you see mso-header-margin is obviously the number inside of Page Setup off of the File menu.
<body lang=EN-US style='tab-interval:.5in'>
<p class=MsoNormal>Hello world</p>
There, that seems to be all of it. Word takes many many steps to make sure that the document
is as preserved as possible and that HTML
is a lossless
format. If you pick apart more complicated documents
you'll notice that your data is held intact across this format
very well, and that was the design goal
. This has all been explained before in a public forum, and I'm not giving away any proprietary secrets of this four year old feature.
Actual hand HTML
editing is done very rarely by a person who would save as HTML
in Word. Bascially, you are looking at a business person who wants to save to the Web, most likely a corporate intranet. In the Macintosh Office
2001 version of the product, there is a save option off of the Save Dialog
that allows you to save as "clean" HTML
without the Microsoft
specific tags (but you lose information in the conversion back to the web.)