What do developers need to know about the various approaches to semantic transparency?
The running theme of the column has been semantic transparency: the ability to correctly interpret the contents of XML documents. Semantic transparency might be the most important aspect of XML modeling. This is first in a series of articles that review the many different approaches to semantic transparency and discuss what they mean to developers using XML.
This is the 30th installment of the Thinking XML column. It is almost exactly four years since the first article, and in retrospect I’m amazed by the flight of time and the march of events since then. The activity associated with XML has been tremendous, and I hope some of it has been apparent in the range of topics covered in this column. This activity has been especially interesting in the use of XML for knowledge management technologies, which is the focus of this column. In the first installment — in February 2001 — I discussed the goal of XML semantic transparency, which I think is the most important aspect of XML data modeling. Throughout this column, I’ve considered different approaches to semantic transparency. For this installment, I’ll kick off a short series of articles that provide an overview of some interesting technologies and techniques for semantic transparency, offering my opinion on the state of the art. I’ll break this series into three parts:
- Using informal descriptions in formal schemata (this article)
- Using schema standardization for top-down semantic transparency
- Using semantic anchors within schemata for bottom-up semantic transparency
Formal schemata, informal transparency
One common misconception about XML is that if you just define a schema, others will know how to process the XML instances and interoperate with your system. This may be true, depending on how the schema is authored, but generally not as a result of features of the schema language itself. Listing 1 is a sample RELAX NG schema (compact syntax) snippet:
For those not familiar with RELAX NG, the first line is a namespace declaration for Dublin Core, which is a popular vocabulary for metadata elements such as titles, descriptions, attributions, and other library-like properties. The second line defines an element named
purchase-order. The line beginning
dc:description is an annotation using the namespace prefix declared earlier to indicate that the intent of the annotation is to provide information that conforms to the Dublin Core description element. The next four lines define an attribute named
id, with a plain text value. This attribute definition has an annotation of its own, giving the intended meaning of the attribute. The line after all that is a comment. Notice that in this example I use annotations to provide information that’s important to understanding the semantics of the schema, whereas I use the comment to convey incidental information. An example of a document that conforms to this schema is:
Notice that the annotations are similar, but the actual element and attribute names are different. A corresponding example document might be:
. A person can look at the two schemata above and recognize from the annotations the equivalence of the
purchase-order element in one to the
po element in the other, and the
id attribute in one to the
number attribute in the other. In this way, semantic transparency is achieved through informal means. A person has to use imprecise natural language skills to make sense of the annotations, rather than some strict and unambiguous definition.
The problem is scalability of this process. The above example has simple, one-to-one mappings between data elements in the two vocabularies, and annotations that you can readily compare in a casual reading. More realistic situations involve more complex schemata with less predictable mappings and subtler differences in annotations and other such informal descriptions. In such cases, it might be very difficult to achieve semantic transparency through natural language schema annotations.
DTDs do not provide directly for annotation, but other popular schema languages do: RELAX NG, W3C XML Schema (WXS), and Schematron. In these languages, you can structure annotations themselves for machine consumption, providing more reliable routes to semantic transparency; I’ll cover some such techniques in future articles. Unfortunately, such techniques are not very well taught, discussed, or even analyzed, partly because many people involved with XML mistakenly believe that semantic transparency is not a pressing concern, or that it is something that XML in itself already provides for. In my own biased view, one particular distraction has interfered with the focus on semantic transparency.
A prominent red herring
XML experts usually recognize the weakness of informal descriptions like those described above for providing semantic transparency. The attempt to boost such facilities has always been part of the what’s next discussion following the success of XML 1.0 — alongside linking, processing conventions, and other concerns. Early on, people tackling such problems split into several camps. In one prominent camp are veterans of mainstream programming languages and database management systems who think the best ways to formalize the underpinnings of XML documents are the common data typing techniques with which they are most familiar. They are accustomed to thinking of all semantics in terms of the primitive axioms that make up the static data typing of mainstream languages and database systems. They feel that if they can just bind XML tightly into familiar metaphors, then they can get a grip on modeling problems.
This time a WXS data type is assigned to the attribute, reflecting the schema designer’s assumption that the purchase order number should be constrained to an integer. That is the meaning of the line
xsd:int. Clearly this addition barely scratches the surface of the problem of proper interpretation of the schema. To be fair, even data typing advocates do not claim it does, but they do claim that this added bit of precision gives processing tools the power to do other sorts of reasoning and analysis on the XML instances. I happen to think this claim is somewhat dubious, and I believe that it has siphoned much energy from the XML community towards a fruitless obsession with data types. This energy might be more usefully directed towards the problem of semantic transparency.
A more direct problem is that when people reflexively use data types, they often end up reducing flexibility in unanticipated ways. As an example, if Zenith Organization, using the schema in Listing 3, wants to trade with Acme Organization, using the schema in Listing 1, there is now the additional complication that one schema sees PO numbers as integers, and the other sees them as plain old text. This mismatch is reflected in all the data-type-aware tools. Such mismatches are inevitable in any integration project, but in this case the gain from strict data typing does not measure up to the flexibility that is lost.
What does this mean to a developer? I don’t mean to argue that you should not use schema data types — just don’t use them as a reflex. Use them to mark very carefully considered constraints that you expect to make sense throughout the life of the system. And don’t get so preoccupied with data typing that you forget to consider how to clarify the more general semantics related to your XML vocabulary.
I myself have added the ability to infer data types from text patterns in XML nodes to the Amara XML toolkit, one of the XML processing libraries I develop for the Python programming language. I am careful to make this type inference optional, and I think it’s probably dangerous to use it as a cornerstone of any processing tool chain. I’ve also given users the capability to set up custom data types in a declarative way using Jeni Tennison’s Data Type Library Language (DTLL — see Resources). DTLL helps make more explicit the fact that data typing in XML is nothing more than a specialized interpretation of text. That is the crux of the matter: XML is text, and only text. Other layers such as data typing are mere interpretations of that text (and should be optional interpretations). The moment you lose sight of that, you’re in for all sorts of unforeseen complications.
Good annotations of schemata are very important, regardless of whether or not they lead to semantic transparency. It might even be enough to maintain a separate data dictionary document, of the sort familiar to database developers. For each term used in the schema, a data dictionary provides a description that informally fills in the semantics for that term. As you recognize the supremacy of text in XML, the importance of semantic transparency becomes clear. Since all XML processing is ultimately a matter of interpreting language, it is essential to find ways to reduce the ambiguity of that interpretation. If you have any thoughts on schemata, schema annotations, data typing, or related topics, please share them by posting on the Thinking XML discussion forum.