Thinking XML: State of the Art in XML Modeling
By Uche Ogbuji2005-05-16
A prominent red herring
XML experts usually recognize the weakness of informal descriptions like those described above for providing semantic transparency. The attempt to boost such facilities has always been part of the what's next discussion following the success of XML 1.0 -- alongside linking, processing conventions, and other concerns. Early on, people tackling such problems split into several camps. In one prominent camp are veterans of mainstream programming languages and database management systems who think the best ways to formalize the underpinnings of XML documents are the common data typing techniques with which they are most familiar. They are accustomed to thinking of all semantics in terms of the primitive axioms that make up the static data typing of mainstream languages and database systems. They feel that if they can just bind XML tightly into familiar metaphors, then they can get a grip on modeling problems.
A data typing proponent might want to touch up the schema in Listing 2 to look like the version in Listing 3.
Listing 3. Sample RELAX NG schema using WXS data types |
This time a WXS data type is assigned to the attribute, reflecting the schema designer's assumption that the purchase order number should be constrained to an integer. That is the meaning of the line xsd:int. Clearly this addition barely scratches the surface of the problem of proper interpretation of the schema. To be fair, even data typing advocates do not claim it does, but they do claim that this added bit of precision gives processing tools the power to do other sorts of reasoning and analysis on the XML instances. I happen to think this claim is somewhat dubious, and I believe that it has siphoned much energy from the XML community towards a fruitless obsession with data types. This energy might be more usefully directed towards the problem of semantic transparency.
A more direct problem is that when people reflexively use data types, they often end up reducing flexibility in unanticipated ways. As an example, if Zenith Organization, using the schema in Listing 3, wants to trade with Acme Organization, using the schema in Listing 1, there is now the additional complication that one schema sees PO numbers as integers, and the other sees them as plain old text. This mismatch is reflected in all the data-type-aware tools. Such mismatches are inevitable in any integration project, but in this case the gain from strict data typing does not measure up to the flexibility that is lost.
What does this mean to a developer? I don't mean to argue that you should not use schema data types -- just don't use them as a reflex. Use them to mark very carefully considered constraints that you expect to make sense throughout the life of the system. And don't get so preoccupied with data typing that you forget to consider how to clarify the more general semantics related to your XML vocabulary.
I myself have added the ability to infer data types from text patterns in XML nodes to the Amara XML toolkit, one of the XML processing libraries I develop for the Python programming language. I am careful to make this type inference optional, and I think it's probably dangerous to use it as a cornerstone of any processing tool chain. I've also given users the capability to set up custom data types in a declarative way using Jeni Tennison's Data Type Library Language (DTLL -- see Resources). DTLL helps make more explicit the fact that data typing in XML is nothing more than a specialized interpretation of text. That is the crux of the matter: XML is text, and only text. Other layers such as data typing are mere interpretations of that text (and should be optional interpretations). The moment you lose sight of that, you're in for all sorts of unforeseen complications.
Tutorial Pages:
» What do developers need to know about the various approaches to semantic transparency?
» Formal schemata, informal transparency
» A prominent red herring
» Wrap-up
» Resources
First published by IBM DeveloperWorks
