Fundamentals of Web publishing with XML
As more developers learn and experiment with XML, many have become interested in using stylesheets to publish and manage Web sites. But getting started is not always that easy. Although none of the concepts, taken in isolation, is difficult, putting them together coherently is not a trivial undertaking. In this article, Benoit Marchal provides step-by-step instructions to get you started. He shows how to organize your project in source, rules (stylesheets), and publishing directories. You’ll also get some practical tips on XML editors.
Through this column’s discussion forum, various mailing lists, and my consulting activity, I have noticed a growing interest in managing and publishing Web sites with XML and XSL. Even though many developers are familiar with XML and XSL, building a coherent system is no small task. In this article, I walk you through a practical, step-by-step example of how to create Web sites in XML.
I will illustrate the technique with a tool developed through this column, the XM plug-in for Eclipse (see Resources). The information provided here is useful even if you use another publishing environment, such as Apache Cocoon, but I find XM to be more user-friendly.
Why XML and XSL?
First, I’ll take a look at the benefits and costs of publishing with XML. You’ll find more than one reason to turn to XML — so many reasons, in fact, that I could not cover them all in this article. I only highlight the most frequently heard motivations:
It’s simpler. You might not think so when you get started, since you need to learn so many new tools, but once you have an XML solution in place your site management chores will be dramatically reduced.
It’s obsolescence-proof. XML separates the content (text and images) from the styling and the publishing, so you can change one independently from the other. For example, when you write new documents you concentrate on the writing and not on the colors, background, or navigation. Conversely, when you change the colors, background, or navigation, XML and XSL automatically update all your pages.
It’s an open standard. XML is supported by many commercial and open-source tools. Even if a vendor disappears, discontinues a product, or doesn’t support the features you need, you can be confident that there’s a replacement.
It’s easily adaptable. XML documents are like tiny databases, and the stylesheets are like scripts that query and manipulate the data from those document databases. These stylesheets are incredibly flexible — from straightforward publishing to computing new content such as tables of contents, indices, and more.
But what about the cost? You have to balance the costs of writing the stylesheets against their benefits. It pays to automate repetitive tasks, but don’t overdo it. If your site contains only a handful of pages, it is faster and cheaper to forgo XML. When the site reaches 10 to 20 pages, XML starts to pay for itself.
Personally, I like XML because it simplifies site management. Several years ago, I had to maintain a site that contained over 100 pages with a regular HTML editor. Believe me, that was no fun. Any change to the site, such as adding or removing sections, would take hours of copying and pasting links. Mistakes and broken links were frequent.
Not so with XML and XSL. Instead, stylesheets automate the boring, repetitive tasks, saving time and minimizing errors. Of course, XML is not the only solution. Some editors offer a template-based approach that is like a combination of XML and XSL. Still, I prefer XSL because it’s a scripting language (limited only by my imagination) and it is not tied to proprietary solutions.
You can use stylesheets on the server, client, or webmaster desktop. The XM plug-in for Eclipse implements the webmaster desktop — and when in batch mode, it also works on servers. The plug-in automatically creates a static Web site (such as a bunch of HTML pages) that are ready to upload to any server. By using stylesheets on the webmaster desktop, you can further increase XML’s flexibility because — unlike the alternatives — it is compatible with every Web server and browser.
When to use dynamic content
While XML complements JSP, it does not compete with it. A typical Web site is 95 percent static content (such as FAQs, images, and descriptions), the rest is more dynamic (such as forums, search forms, or shopping carts). When it comes to static content, XML excels, whereas JSP is ideal for the more dynamic content.
To get the best of both worlds, I often generate JSP pages through XML and XSL. Again, the goal is to isolate the content from the publishing. To generate a JSP page instead of an HTML page with the XM plug-in, add the xr:extension=”jsp” attribute to the xsl:output element.
What about servlets, JSP, PHP, or ASP? In other words, what about dynamically generated Web sites? Many shops have turned to dynamic hosting to gain the same benefits and simplify site maintenance. The code in the servlet or JSP page takes care of the presentation. How does this compare to XML and XSL?
In a nutshell, XML is more efficient. Dynamic sites tend to be slower because the server computes the page for every request. Those sites are also more difficult to set up and maintain, which unfortunately often translates into less stable sites. I know there are ways around all these problems, but you will find that XML delivers better results at a fraction of the cost.
To get started, download Eclipse and the XM plug-in for Eclipse (see Resources for links). XM is a project of the Working XML column that enhances Eclipse to support Web publishing with XML and XSL. XM is also available as standalone software that is ideal for batch processing. To prepare this column, I used Eclipse 2.1 and XM 0.9. Follow the instructions on the Eclipse and XM Web sites to install the software.
Launch Eclipse, then click Project from the File > New menu. In the dialog box that opens (see Figure 1), select ananas.org and XM Project, then click Next. Enter a project name, such as mysite, then click Finish.
Figure 1. Creating a new project
The new project appears in the navigator. When you open the project, you see that it contains three directories: publish, rules, and src, as shown in Figure 2. If you don’t see the navigator, click Navigator from the Window > Show View menu.
Figure 2. The new project in the navigator
The src (source) directory holds your XML documents as well as your images and other support files. The plug-in creates a sample file to get you started. You should edit it to insert your own content and add as many other XML files as needed. Every XML file in the src directory becomes an HTML page on the Web site.
The XML editors section introduces the tools to write XML documents. For the time being, just open the XML document in a text editor, such as Eclipse. The sample document uses a simplified version of DocBook with the following tags:
• article: The root of the document
• articleinfo: Contains bibliographical information
• sect1: A document section
• sect1info: Contains the section title
• title: May appear under articleinfo or sect1info as a title
• copyright: Holds the copyright information as one or more year tags and one holder tag
• simpara: A paragraph
• ulink: A hyperlink
You can use other tags, but you need to edit the stylesheet accordingly.
As I mentioned, the sample document is derived from DocBook. However, it uses a different namespace to indicate it’s not the real thing. DocBook is a standard vocabulary for technical documentation. It was originally developed by O’Reilly and it is maintained by OASIS, an international association of XML users.
You might find DocBook is a good choice to get started because it’s available, it works, it’s a standard, and it’s popular (mostly because it’s popular). Hundreds of existing XML tools work with DocBook — obviously, more tools on the market means less work for you.
Other popular XML vocabularies for Web sites include NewsML from the International Press and Telecommunication Council (IPTC), the Web page DTD from Norman Walsh (Norman Walsh also maintains the DocBook vocabulary), and the Apache Cocoon DTD.
DTD or schema?
Should you use DTDs or schemas? In practice, it does not really matter. Both validate your document against a given vocabulary. Schemas offer more control than DTDs, but the new features have been introduced primarily for e-business and are less important in a publishing application. Since modern editors work equally well with DTDs and schemas, you can use whichever you like best.
The rules directory contains the stylesheets. Most Web sites need only one stylesheet. The XM plug-in for Eclipse applies the default.xsl stylesheet to every document, unless it is told otherwise. Consequently, if your site has only one stylesheet, save it as rules/default.xsl. If your site needs more stylesheets, save them under rules and add the following processing instructions to those documents that do not use the default:
<?xml-stylesheet href=”listing.xsl” type=”text/xsl”?>
Beware! The processing instruction needs both parameters: href points to the stylesheet (you can just enter the file name — the XM plug-in automatically looks under the rules directory), and type must have the text/xsl value. Also remember that the processing instruction applies to documents (in the src directory), not to stylesheets (in the rules directory).
Last but not least is the publish directory, where the plug-in generates your Web site. Your next step is to upload the content of this directory to your Web server.
Another warning: You should never try to edit or modify the files in the publish directory. If you’re not happy with a Web page, change the XML document (in the src directory) or the stylesheet (in the rules directory), but never try to edit anything in the publish directory. Your goal is to automate publishing chores — editing the site directly defeats that goal. Furthermore, the plug-in may overwrite your changes the next time it regenerates the site.
Enhancing the site
As you have seen in the previous section, the project wizard creates a sample site. The next step is to populate the src directory and enhance the stylesheets. If you adopt a popular vocabulary, such as DocBook, you can find pre-existing stylesheets that should speed up the process.
If you are migrating from another publishing tool, it might not be possible to convert your content to XML overnight. Don’t worry — the plug-in publishes any HTML file that appears in the src directory, so you can convert to XML gradually.
Since you have to edit many XML documents, it pays to invest in a good XML editor. Your options are:
• A text editor, such as Eclipse. Text editors are appropriate for small corrections, but they are too cumbersome for serious editing.
• A pseudo-WYSIWYG XML editor such as XMetaL or XMLMind. These editors emulate a word processor and are ideal for serious editing.
• An RTF converter. These work with your word processor to generate XML and are perfect when you collect documents from many different authors who may not be familiar with XML.
Which option is best depends on the job at hand. I find it nearly impossible to write long documents with a text editor. Having to remember to balance open and close tags is a huge drag on my productivity. Most authors are uncomfortable with text editors for anything but the most basic corrections.
Pseudo-WYSIWYG editors offer the most comfortable environment and the author doesn’t need to worry about the XML syntax (see Figure 3). They are called “pseudo-WYSIWYG” because they use color, boldness, and other typographic attributes to emulate a word processor with XML content. If you have never tried a pseudo-WYSIWYG editor, do yourself a favour and download an evaluation version right away. Be warned that the editors don’t work right out of the box — they must be customized for a given vocabulary. Fortunately, most editors ship with native support for DocBook — another reason to adopt this popular vocabulary.
The last solution is to stick with your word processor and use an RTF converter to generate the XML document. In practice, you might find that the conversions are seldom trouble-free, but it’s a good solution if you collect documents from authors who are not familiar with XML. At Pineapplesoft, we maintain community Web sites where many authors contribute to the sites, and we use converters extensively.
Figure 3. A pseudo-WYSIWYG editor
Hyperlinks and URLs
As a bonus, the XM plug-in for Eclipse manages hyperlinks to prevent broken links. The plug-in works with so-called relative URLs (URLs that give the path relative to the current file). Listing 1 shows a relative URL example.
Listing 1. Relative URL
Absolute URLs, on the other end, either include a host name or give the path from the root of the Web site. Listing 2 shows an absolute URL example.
Listing 2. Absolute URL
You should use relative hyperlinks as much as possible because the XM plug-in:
• Updates the file extension if needed, changing from .xml to .html wherever necessary
• Tests the link and issue a warning if it’s broken
Error messages and troubleshooting
The plug-in reports problems with your XML documents or your stylesheets in the XM console. If you don’t see the console, select XM Console from the Window > Show View menu. Read the error message carefully because it includes a description of the problem. The plug-in also lists the file and line where the error occurred (though it might be off by a line or two, so make sure to review the lines before and after the problem as well).
If the plug-in generates blank Web pages:
• Read error messages in the XM console carefully
• Make sure that the corresponding XML document is not empty
• Check that your stylesheet is appropriate for the document vocabulary, paying special attention to namespace
When something looks really weird, double-check the namespaces and the element names. Namespace mismatches account for 25 percent of all my students’ problems.
I conclude with a few tips on the XM plug-in.
Select Preferences from the Window menu. Under the Workbench category, choose the File Associations entry and associate an editor with *.xml and *.xsl files. You can associate one of the many Eclipse text editors or use an external editor (such as XMLMind). When you double-click the file, it automatically opens the editor.
To choose the editor when you open the file, right-click it in the navigator and choose the Open with menu.
Eclipse automatically generates the Web site when you save a document from within Eclipse. If you are using an external editor, such as XMLMind, select Rebuild Project from the Project menu. If the menu is grayed, click on the project in the navigator first.
The XM plug-in does not recognize changes to the stylesheets when it rebuilds the Web site. If it looks like the plug-in is ignoring your changes, follow these steps:
1. Right-click on the project name in the navigator (mysite in the above example), then choose Properties.
2. Select XM Properties and make sure that “Run XM” performs a build is checked (see Figure 4).
3. Click OK.
4. Right-click on the project again, then choose “Run XM”.
Figure 4. Edit the properties
I hope this article has convinced you that publishing a Web site with XML and XSL is fun and offers many benefits. XSL is a powerful tool — and the XM plug-in further extends this power, so I could only scratch the surface in this article. To learn more about all the features available, I suggest you read the earlier articles in the Working XML column on developerWorks.
When you download the plug-in, you will find a copy of the ananas.org project. That’s the project I use to maintain the site and it demonstrates many advanced features. You might want to study this code as well. Finally, make sure you join the Working XML discussion forum.
• Participate in the discussion forum on this article. (You can also click Discuss at the top or bottom of the article to access the forum.)
#8226; Check out some of the other installments in the "Working XML" column. Upcoming articles will cover more advanced options on Web publishing with XML and XSL.
• If you edit XML documents regularly, invest in a pseudo-WYSIWYG editor. If you have always edited XML documents with a text editor, do yourself a favour and download an evaluation version right away. Some of the most popular editors include XMetaL from Corel, the XMLMind Editor (available on many platforms), and x4o from i4i (a hybrid product that works with Word).
• For more on XSL stylesheets, try these two developerWorks tutorials: "Transforming XML documents" (May 2000) and "Developing XSL transformations with WebSphere Studio" (April 2002).
• Find more XML resources on the developerWorks XML zone.
• Get IBM WebSphere Studio, a suite of tools that automate XML development, both in Java and in other languages. It is closely integrated with the WebSphere Application Server, but can also be used with other J2EE servers.
• Find out how you can become an IBM Certified Developer in XML and related technologies.