The XML revolution

Toby Howard

This article first appeared in Personal Computer World magazine, May 1998.

IT'S BEEN CALLED the Second Coming of the Web. After a February 1998 announcement by the World Wide Web Consortium, the closest the Web has to a governing body, many Web watchers have been seized with an almost religious fervour. The bad old days of an information-saturated but essentially unintelligent Web are over, they say, and its saviour is "meta-data" -- information about information.

Although the Web is a global repository of information on a scale the world has never seen before, because the information is stored in unstructured blobs, finding what you want is hard, and getting harder. Automated search engines, such as those consulted by the excellent MetaCrawler scan huge databases compiled by indexing programs, which ferret around as much of the Web as they can, noting which words occur in which pages. If a page happens to contain the text "Absolutely no information whatsoever about penguins on this page", as far as a search engine is concerned, it's as valued a penguin resource as a specialised site such as The Penguin Page. Run a Web search for "penguin" and you'll probably find both pages, but you'll also find the Pittsburgh Penguins Ice Hockey team, Penguin Books, an on-line club for running enthusiasts, and an interactive dating agency in Utah.

The data deluge is getting steadily worse. In the West at least, data is becoming the most important commodity we have. And with the inevitable merge of television and the Web in the pipeline, many analysts predict that information management will be the leading technology of the next century.

If anything has been responsible for the enormous growth of the Web, it's been the simplicity of its lingua franca, HTML. It's a small, simple and inflexible language -- precisely the attributes an IT language needs for fast mass acceptance. But, as the Web has developed, HTML has started to creak under the strain. The problem is that it codes the visual presentation of Web documents, not their information content. But now there's a new language for the Web: the "Extensible Markup Language", or XML.

Here's a slightly more useful version of my penguins page. In HTML, I might write:

There are at least <I>seventeen</I>species of
<B>penguin</B>, and not all live in <a
href="http://ice.wizard.net/">Antarctica</a>.

Apart from the hyperlink tag, the other tags control the visual appearance of the words "seventeen" (italics) and "penguin" (bold). XML takes a quite different approach, by allowing tags that can be used to describe the data in the document. To illustrate, my penguins page in XML might look like this:

<penguin-bird-facts>
   There are at least
   <penguin-species>
      seventeen
   </penguin-species> 
   species of penguin, and not all live in
   <places-penguins-live>
      <a href="http://ice.wizard.net/">Antarctica</a>
   Antarctica
   </places-penguins-live>
</penguin-bird-facts>

In this (only slightly artificial!) example, the tags provide information about the information in the page. They're "meta-data". The tag, for example, says that all the text between it and the matching , is "useful information about penguins". It's unlikely that the same tags would be used in Web pages published by Penguin Books. XML does not, of course, provide a set of tags for penguin enthusiasts. Instead, it provides a powerful mechanism for you to define any tag you like, to suit to your own purposes. You create a "document type definition", which specifies what your tags mean, and refer to this within the XML file. Your custom tags only ever describe the structure of the data in your document; to specify visual formatting, XML uses a "style sheet" mechanism similar to the Cascading Style Sheets that accompany HTML.

As well as custom tags, XML provides much greater sophistication with hyperlinks. In HTML, clicking on a hyperlink takes you direct to the appropriate Web resource. In XML, links can be bi-directional, or clicking on a link might bring up a menu of related links; and links can be "transcluded" -- the referred page is seamlessly inserted into the page you are reading.

However, the power of XML comes at a price -- discipline. Much of the HTML on the Web is actually incorrect, but browsers are extremely tolerant. When someone starts to write a <A HREF=> hyperlink and forgets the matching </A> tag, the rest of the text in the document becomes a giant highlighted hyperlink. With XML, this can never happen. Conforming browers are simply not permitted to ignore faulty tags and carry on as best as they can. The rule is simple: documents which contain incorrect XML code are ignored. Sceptics might be thinking: nobody will buy this, but in fact both Netscape and Microsoft argued vigorously for it. If you can browse an XML page, you can be sure that it is correct. And well-formed, syntactically correct pages are essential if the Web is to scale without falling apart.

As for meta-data being the saviour of the Web, XML is providing the basis for the development of a new proposed standard called the Resource Description Framework, or RDF. RDF is intended to provide an industry-wide standard for describing and organising Web data, and promises to revolutionise Web searching and navigation. Although XML is streamlined for the Web, the vision is that ultimately RDF will unify all the information that comes our way: email, newsgroups, Web searching, databases, and even the files on our hard disks.

When the problem of global information storage and delivery is eventually solved, we'll enjoy easy access to masses of it. The next question will be: is it any good?

Toby Howard teaches at the University of Manchester.