Tip: Convert from HTML to XML with HTML Tidy
By Benoit Marchal2003-12-16
Tidying Up
Obviously, the first step is to download and install HTML Tidy (which you'll find in Resources). HTML Tidy is available on most platforms, including Windows, Linux, and MacOS. The default executable is a command-line tool, but GUI versions are available for Windows and MacOS.
To run HTML Tidy, open a terminal and issue the following command:
|
That's it! HTML Tidy immediately converts index.html into index.xml. HTML Tidy will print messages that highlight issues with the original HTML document during the conversion. In most cases, you can safely ignore these messages.
HTML Tidy runs as a filter, so it expects standard input and it prints the result to the standard output. The redirection operators (< and >) allow you to work with files. By default, HTML Tidy produces a clean HTML page, but you can set two options to output XML, instead:
-asxhtml outputs XHTML documents instead of HTML.
-numeric uses character entities instead of HTML entities. For example, î is replaced with î.
|
XPaths and empty elements |
<p>, <b>, and <a> tags, for example), but the syntax is XML, so it merges nicely in an XML workflow.
The main differences between HTML and XHTML are:
<p> unless they are empty elements.<br /> instead of <br>.<a href="http://www.marchal.com"> instead of <a href=http://www.marchal.com>).Listing 2 is the file that HTML Tidy produces when Listing 1 is provided as input. As you can see, it is a valid XML document, and it takes surprisingly little work to produce it.
Tutorial Pages:
» Preserve Legacy Web Sites With This Handy Utility
» Tool Of The Trade
» Listing 1. index.html (an excerpt)
» Tidying Up
» Listing 2. index.xml (an excerpt)
» Further Processing
» Listing 3. index-transform.xml (an excerpt)
» Listing 4. cleanup.xsl
» Conclusion
First published by IBM developerWorks
