A senior Microsoft executive once admitted that Word files are ‘mostly air’. He was explaining that along with your text, there is a vast amount of irrelevant, redundant, or bogus information in the .doc file you save. In itself that's not so bad: it wastes a lot of unnecessary disk space, but Word ignores it and you never see it.
Unfortunately, exactly what was in Word files was a tightly-guarded corporate secret, available only at a substantial licence cost to companies needing to write programs that understands .doc files. Even then, you didn't get all the details: Microsoft is constantly inventing new ways of making it harder for competitors to handle its files reliably.
This was the position until recently. Now, however, the .doc file format is dead. From 2009, the default format for saving all Word files is XML, which is based on the international standard for text applications1. XML is a public file format which can be opened and read by any program, so nothing is secret or hidden any more. Regrettably, however, Microsoft chose to obfuscate their new formats (WordML and OOXML) to make them harder to reuse.
This means that Word XML (even WordML or OOXML) files can be used more reliably for publishing, whereas .doc files were virtually unreadable except in another wordprocessor or with special converters, and contained unknown secrets and hidden traps.
But Word is still just a wordprocessor, designed for simple office (business) documents. It was never designed for complex academic or research work, and it still lacks many of the features needed. It was also not designed for publication-quality work; and it certainly wasn't designed for long-term information storage.2 The use of XML is not some kind of silver bullet to solve all problems of document storage and re-use. WordML and OOXML are just more accessible ways of representing a .doc file, in a slightly more rational format. They still only represent the appearance of a document, like a fax or a scan: they make no meaningful use of the features found in ‘normal’ XML applications elsewhere for structuring documents, making writing easier, automating production processes, or for making the files more reliable or robust. In Word, all this still has to be done manually by the author or editor (you!).
XML is actually a metalanguage (a language for describing languages), based on SGML (ISO 8879). XHTML is one language which uses XML (the original HTML used SGML); WordML is another; and there are thousands more, like DocBook (for computing documentation), TEI (for literary and historical transcriptions), and ODF (the international standard for office documents, unfortunately not used by Microsoft). ⇚
In fact just the opposite: Word can't even open its own files from years ago any more. You are supposed to keep them current by manually re-opening them and re-saving them each time you upgrade your copy of Word to a new version. ⇚
In the normal way of working, Word has absolutely no idea what you're typing or editing. As far as it's concerned, the document is just like a large bag of Scrabble™ tiles, and about as useful. Most authors use Word like an Etch-A-Sketch™ to ‘draw’ their documents, in the belief that if it looks pretty, it must be right.
For example, Word has no idea where the address is in a letter; no idea where the chapters or sections of your report stop and start; no idea what the author or abstract are in an article; and no idea that the list of agenda items in your committee Minutes is a different thing from the list of people who attended.3
All these things can be made to look different, using formatting, but that's just for the benefit of human readers. A computer doesn't have eyes and a brain, so it needs to have things labelled by name, not by visual appearance. Currently, authors tend to spend more time making their documents look attractive—or even just trying to make them look right; or struggling with the mouse and menus—than they do actually writing.
Which is why you can't get Word to go through all the Minutes and list those committee members who have missed more than five meetings in the past year on occasions when the budget was on the agenda. ⇚
But Word in fact does have a way to label the different parts of your document—they're just not taught in most courses or mentioned in most documentation. There is a set of built-in Named Styles (and you can invent your own) which do two things: a) they can format the named parts of your document automatically; and b) they can identify them for the computer to use later. When set up, this makes writing and editing hugely faster and more accurate, as well as ensuring that formatting is not lost or garbled when the document moves to another user's system, and ensuring that publishing systems can handle the document reliably.
With Named Styles in a template, an address can be internally (and invisibly) labelled Address; each chapter heading can be labelled Chapter; the title and author can be labelled Title and Author; and the agenda items could be labelled differently from the list of members.
Using Named Styles is a cultural shift, away from concentrating on what your document looks like, and towards thinking about what it means. The logic is that the formatting should depend on the meaning, not the other way round.
In general, the use of Named Styles only applies to documents that you consider important, and thus worth preserving. For temporary, transient, or trivial documents, or for documents where the visual impact is more important than the message, you can continue to use visual formatting alone.
Word is actually one of the last systems to implement Named Styles. WordPerfect has had what they called ‘Reveal Tags’ for 20 years or so. LATEX (1985) is built on the principle of named environments, as was SGML (1986), the forerunner of today's XML. And RUNOFF, the grand-daddy of all text processing systems, used a similar labelling concept as far back the late 1960s.
Keep up to date with our RSS newsfeed |
, Electronic Publishing Unit • 2011-07-03 • (other) |