Getting things done with the Extensible Markup Language
In March, I wrote an article (see Resources later in this article) about the Extensible Markup Language and its affinity to Linux and the Linux way of doing things. Due to overwhelming reader feedback, we have decided to schedule a series of follow-up articles. In this article and others to follow, I’ll take a closer look at some of the practical things you can do with XML.
Luckily for this purpose, the Linux community has taken to XML as well as I could have hoped. Many Linux development projects and languages use XML processors and libraries. The Cocoon project is building around Apache an XML-processing system that is, in at least one area, ahead of most commercial equivalents. KDE, the K Desktop Environment, uses XML as the native file format for its impressive KOffice. GNOME has an entire menagerie of XML tools, libraries, and applications, some part of the general releases and some strictly in CVS, the Concurrent Versions System. It is also nice to see that a trickle of apps — mostly GNOME Toolkit apps using libxml — are moving to XML-based config files, as I advocated in April.
In commercial space, both IBM’s DB2 Universal Datavase 6.1 and Oracle 8i have come to Linux with an aggressive adoption of XML and many tools for XML document management, with IBM’s WebSphere for Linux providing XML server tools, including a parser and data transformation tools. The other Universal DBMS engines for Linux are not far behind.
But there’s no need for me to trot out a long list of the XML projects for Linux. A quick search of Freshmeat with keywords such as XML, DOM, and XSL will yield riches for XML newbies and gurus alike.
What’s new in tag land?
The standards bodies have been just as busy as the Linux hackers. It’s pretty safe to say that right now the key standards for XML are either at 1.0 or getting there quickly. These are the standards that will help truly establish XML’s interoperability:
- XML: the core standard
- Namespaces: a way to resolve tag-name clashes within XML documents
- DOM: the Document Object Model, a set of standard interfaces for accessing XML and HTML document components
- XSLT: Extensible Stylesheet Language Tranformation, a general language that is the best approach for exporting XML to an HTML-browser world
- XLink: the extended linking capabilities discussed in the April article
- XML Schema: perhaps not as far along as the others, this important spec addresses several shortcomings of DTD, the document type definition, for expressing constraints on XML documents
Other standards are complete or in process; they include XML Fragments, XML Query Language, and XHTML, an XML-compliant and thus thankfully strict dialect of HTML. But those are not as central as the standards in the list.
This article briefly describes these standards. Later, I’ll discuss how all these technologies can come together under Linux to lend a great deal of power to Linux applications.
First a few words on the font of all this activity: The XML 1.0 specification is still an amazing piece of work. It tackles the core language and the normal behavior of XML processors, dealing with such complex issues as character encodings along the way. If you want an esoteric discussion of what the standard should or should not have inherited from SGML, and other such minutiae, you could always watch such expert mailing lists as xml-dev, but for most purposes the core XML spec is a rock — and a particularly good example for other standards in every area except readability.
Namespaces in XML
One key area that the core XML didn’t address is name clashes. Suppose we are using tags from two specifications; perhaps one is for general document formats (for emphasis, titles, paragraphs, etc.) and another is for marketing terminology. In the former, the element-type name code refers to text that should be formatted as computer source code; in the latter, code is part of a product specification.
Although it might be clear to a person that code here refers to a product code and not to computer source code, it might not be clear to search engines and style processors, and the results of their processing might reflect the confusion. Notice that the element-type name description suffers from the same problem. XML needs the capability, common in programming languages and such, of specifying universally unique names. Enter the Namespaces in XML recommendation from W3C, the Worldwide Web Consortium. Under this spec, the code could be rewritten as follows.
Now all the potentially ambiguous names are qualified in a standard manner. The
<document> tag defines two namespaces. The first is marked by the prefix
mkt , an alias for the URI (Uniform Resource Identifier) http://our.industry.org/schema/product-info. A special attribute name starting with
xmlns: indicates that a namespace is being defined, and the rest of the attribute name specifies the prefix to be used in the names of elements in the new namespace. To an XML processor that handles namespaces, the tag
<product> is qualified by that URI. Note that a URI is a superset of URL. Also note that a namespace definition is a URI by format, but the spec disavows any particular meaning for the URI. If it is a URL, there is no guarantee you will find anything at that URL: it is merely a unique string. There has been some debate about this lack of namespace meaning, and about the choice of URIs for naming, but the system does work well.
The second attribute of
<document> defines the default namespace marked by
xmlns with no prefix indicated. The XML processor assumes that all nonprefixed element-type names are in that default namespace. Namespaces, including default namespaces, can be overridden. For instance, the code above is basically equivalent to the following:
Attribute names can also be in a particular namespace.
spam attribute name is in the namespace
http://www.a.cd/ns1 , and the
eggs name in the namespace
http://www.b.de/ns2 . The rules for default namespaces differ a bit between elements and attributes, however. For more information, read James Clark’s tutorial on namespaces (see Resources). Versions 1.4 and later of
libxml support XML namespaces, although some bug fixes for attribute namespaces have been earmarked for the 1.6 release. This is pretty good progress considering that not many XML processors support namespaces yet. However, the popular Simple API for XML (SAX), which many parsers use as a frontend, does not yet support namespaces. Nor does the DOM support namespaces. And speaking of the DOM …
The Document Object Model (DOM)
The DOM reflects the natural tree structure of XML documents: most XML components are instances of the abstract class
Node , which has attributes such as
firstChild , and so on.
The DOM defines more specialized interfaces for documents, elements, text, attributes, entities, and other abstractions. It also provides standard collection classes for nodes.
Several XML processing tools for Linux provide DOM interfaces. The most versatile libraries and parsers provide SAX for a straightforward, sequential processing of XML source, and DOM for cases where random access to document components is desired.
libxml-perl , a collection of Perl XML tools that is very different from
libxml , supports both SAX and the DOM, as does the Python XML package. IBM’s XML4J and XML4C, powerful XML libraries for Java and C, support both interfaces as well.
libxml has a standard SAX interface and bases its internal data structures on the DOM. A few important features haven’t been standardized for the DOM, particularly support for namespaces, for information provided in DTDs, and for interfaces for reading and writing XML source. Those lacks are being addressed in DOM updates (called Levels by the W3C).
XML Stylesheet Language Transformations (XSLT)
The most exciting thing about XML is the way it allows people to define their own sets of tags with their own meanings. The natural complication of this has been an explosion of languages, some standard and some proprietary, based on XML. The XML community recognized the need for a standard approach to transforming XML, one that could convert documents from one form to another — and even allow the general processing of XML data. XSLT provides that facility.
But the development of XSLT wasn’t in quite so straight a line as one might have thought. As its name implies, it is part of the XML Stylesheet Language effort for rendering XML documents into various media. The W3C first developed a general XML vocabulary for expressing presentation elements (similar to the role of HTML) and specified XSLT as a way to process rich XML data into the pure presentation format, known as formatting objects (FOs).
Of course, the main problem people were trying to solve all the while was how to render XML documents to HTML-based Web browsers. They used the XSL transforms mostly to produce HTML rather than FOs. The W3C realized that the transformation language was really an entity all its own and began to develop it separately.
XSLT is a powerful language, but its syntax might be perplexing to C/C++ programmers. Really more of a functional language, XSLT will at first probably come more naturally to users of Lisp, Scheme, and the like. But XSLT doesn’t take too long to figure out. As a basic example, let us look at a stylesheet we might use to render a memo from the Namespaces example as HTML.
Don’t worry if your head is swimming. I’ll go over how the code works. But it might be useful to first view the output. If you run a stylesheet processor against the first version of the marketing memo, the one that does not use namespaces, and the stylesheet above, you should get the following result (or one very similar):
First, several stylesheet processors are available for Linux. I used 4XSLT, a Python XSLT processor developed by my company, but there is also XT by James Clark, who deserves as much recognition as any other open source pioneer for the tremendous work he has put into freeware SGML and XML tools. Countless SGML and XML users were raised on his Expat, Jade, XT, and other tools. There is also IBM’s LotusXSL. See Resources for links to these tools, all of which are freeware. There are other processors that will work in Linux, but those should be a good start. In most cases you can just specify the files containing the XML source and the stylesheet, but there is also a standard processing instruction for specifying stylesheets — which can be the older cascading stylesheets (CSS) or XSLT — in XML documents.
To try the example using 4XSLT, install the software and copy the XML source to
xslt-demo.xml , the stylesheet to
xslt-demo.xslt , and then enter
The result will be printed to standard output.
And now to explain the stylesheet. As you can see, it is regular XML. You might also notice that, as mandated by the standard, it declares a namespace,
http://www.w3.org/XSL/Transform/1.0 . The elements in the namespace are are known as instructions; they direct the flow and output of processing. You’ll notice the several template instructions, known as template rules. These indicate to the processor that we have rules to apply whenever we come across parts of the source XML that match the template. Examine the first template
match attribute has a value of
"/" , which indicates that it should match the entire document. The notation in the match clauses is a special and somewhat complex pattern language in which slashes separate levels of tags in a manner analogous to slashes in the Unix directory hierarchy. The entire document contains one element with the name
document, so to specify this element, we would say
/document . The document contains a paragraph element, which we could indicate with
/document/paragraph , and so on. There is much more to the patterns, but that will suffice for our example.
Once the stylesheet processor finds the match, it processes the instructions within the template. Following special, implicit instructions, it simply echoes the text and any elements that are not instructions within a template. Those elements are known as literals, and accordingly, the processor begins by putting out
Then it runs into the
value-of instruction. This instruction evaluates the contents of its
select attribute using an expression language that is a superset of the pattern language we mentioned. The expression language is a separate standard called XPath. You can see that this value would refer to the
title element inside the
document element. In XPath expressions, the value of an element is the concatenated value of all its contents. The content of the element in question is the text "Re: Widget 404 Request." The XSLT processor then writes that to the output and continues echoing the literals after the
value-of instruction, so we now have
It’s beginning to take shape, no? Now we come to the
apply-templates instruction. This is the recursive heart of the processing language. The matching of a template establishes a context that has several aspects. Most important, the context marks where in the XML source we are. Since the first template matched the entire document, that is our context. An analogy to the Linux file system works here: The context is similar to the current working directory. Within the first template the current working directory is analogous to the root directory. Within a context, just as in Linux file systems, you don’t need to always specify the entire path to a template match. In our current context,
document is the same as
/document . What
apply-templates does is look through all the templates and process all those that match in the current context. Let’s look at the second template:
The processor will look for an element named
title at the top level of the XML source. Of course the search will fail, as will all the other searches. Where does the processor go from here? XSLT defines a few built-in or default template rules that match XML source that explicit rules don’t address. One of these matches any element at the top level of the current context and simply executes
apply-templates on its contents. There is only one element at this level,
document . So the processor now calls
apply-templates within this built-in rule, and the context is shifted to the
Now several templates match. The first element in the current context is
description . No template matches this, so the processor uses the built-in rule and calls
apply-templates for the
/document/description context. All there is at this level is the text "Memo." Another built-in rule simply echoes the text into the output. Next is the
title element, matched by the second template we defined. This template puts out the literal
<H1> tag, and calls
apply-templates within it. Again the only XML source at this
/document/title context is the text "Re: Widget 404 Request," which is echoed to output according to the built-in rule for text. So our output at this point is
Hopefully, this is enough explanation to get you started. Our example stylesheet uses features we haven’t discussed, such as the
attribute instructions, but I’m out of space to explain them. The concepts I have gone over — templates, patterns, expressions, and context — are central to XSLT, however, and if you follow them, you’ll probably find the rest of XSLT pretty straightforward. James Tauber’s XSLT tutorial, listed in the Resources section, is a good place to start, although as I write it is still a bit out-of-date. If you are lucky enough to have someone who can pay for instructional material, or have a few dollars to spare, you’ll also find a link to inexpensive commercial training materials in Resources.
We have covered a lot of ground in this article. I have tried to give enough of an introduction to the core technologies that we can get down to the grit of using XML practically in Linux systems. The further tutorials and Linux software I have pointed out may get you experimenting in the meantime.
You may notice that I have not discussed two technologies that I claim are key: XLink and Schemas. The main reason is that little software implements those technologies at the moment, so there isn’t anywhere to go with them in a practical view of XML under Linux. I don’t think it will be long before suitable implementations emerge. My company is itself working on an open-source Python XLink processor and is also examining Schemas. As implementations emerge, I’ll be happy to add them to the survey.
And of course, I would love to hear how you are already using XML in Linux, what tools you find useful, and what your impressions are of the general technology so far.
libxml, a general Linux/Unix library for XML
libxml-perl, a collection of Perl tools for XML processing