///Thinking XML: State of the Art in XML Modeling

Thinking XML: State of the Art in XML Modeling

What do developers need to know about the various approaches to semantic transparency?

The running theme of the column has been semantic transparency: the ability to correctly interpret the contents of XML documents. Semantic transparency might be the most important aspect of XML modeling. This is first in a series of articles that review the many different approaches to semantic transparency and discuss what they mean to developers using XML.

This is the 30th installment of the Thinking XML column. It is almost exactly four years since the first article, and in retrospect I’m amazed by the flight of time and the march of events since then. The activity associated with XML has been tremendous, and I hope some of it has been apparent in the range of topics covered in this column. This activity has been especially interesting in the use of XML for knowledge management technologies, which is the focus of this column. In the first installment — in February 2001 — I discussed the goal of XML semantic transparency, which I think is the most important aspect of XML data modeling. Throughout this column, I’ve considered different approaches to semantic transparency. For this installment, I’ll kick off a short series of articles that provide an overview of some interesting technologies and techniques for semantic transparency, offering my opinion on the state of the art. I’ll break this series into three parts:

  1. Using informal descriptions in formal schemata (this article)
  2. Using schema standardization for top-down semantic transparency
  3. Using semantic anchors within schemata for bottom-up semantic transparency

Formal schemata, informal transparency

One common misconception about XML is that if you just define a schema, others will know how to process the XML instances and interoperate with your system. This may be true, depending on how the schema is authored, but generally not as a result of features of the schema language itself. Listing 1 is a sample RELAX NG schema (compact syntax) snippet:

Listing 1. Sample RELAX NG schema using annotations to provide semantic clues



namespace dc = "http://purl.org/dc/elements/1.1/"
element purchase-order
{
dc:description [ "General purpose purchase order for merchandise" ]
attribute id {
dc:description [ "Unique identifier for the purchase order" ]
text
}
#The rest of the schema here
}

For those not familiar with RELAX NG, the first line is a namespace declaration for Dublin Core, which is a popular vocabulary for metadata elements such as titles, descriptions, attributions, and other library-like properties. The second line defines an element named purchase-order. The line beginning dc:description is an annotation using the namespace prefix declared earlier to indicate that the intent of the annotation is to provide information that conforms to the Dublin Core description element. The next four lines define an attribute named id, with a plain text value. This attribute definition has an annotation of its own, giving the intended meaning of the attribute. The line after all that is a comment. Notice that in this example I use annotations to provide information that’s important to understanding the semantics of the schema, whereas I use the comment to convey incidental information. An example of a document that conforms to this schema is: .

If Listing 1 is the purchase order schema that Acme Organization comes up with, then Zenith Organization, acting separately, might come up with the schema in Listing 2.

Listing 2. Sample RELAX NG schema similar to Listing 1



namespace dc = "http://purl.org/dc/elements/1.1/"
element po
{
dc:description [ "Simple purchase order" ]
attribute number {
dc:description [ "Number for identifying the purchase order" ]
text
}
#The rest of the schema here
}

Notice that the annotations are similar, but the actual element and attribute names are different. A corresponding example document might be: . A person can look at the two schemata above and recognize from the annotations the equivalence of the purchase-order element in one to the po element in the other, and the id attribute in one to the number attribute in the other. In this way, semantic transparency is achieved through informal means. A person has to use imprecise natural language skills to make sense of the annotations, rather than some strict and unambiguous definition.

The problem is scalability of this process. The above example has simple, one-to-one mappings between data elements in the two vocabularies, and annotations that you can readily compare in a casual reading. More realistic situations involve more complex schemata with less predictable mappings and subtler differences in annotations and other such informal descriptions. In such cases, it might be very difficult to achieve semantic transparency through natural language schema annotations.

DTDs do not provide directly for annotation, but other popular schema languages do: RELAX NG, W3C XML Schema (WXS), and Schematron. In these languages, you can structure annotations themselves for machine consumption, providing more reliable routes to semantic transparency; I’ll cover some such techniques in future articles. Unfortunately, such techniques are not very well taught, discussed, or even analyzed, partly because many people involved with XML mistakenly believe that semantic transparency is not a pressing concern, or that it is something that XML in itself already provides for. In my own biased view, one particular distraction has interfered with the focus on semantic transparency.

A prominent red herring

XML experts usually recognize the weakness of informal descriptions like those described above for providing semantic transparency. The attempt to boost such facilities has always been part of the what’s next discussion following the success of XML 1.0 — alongside linking, processing conventions, and other concerns. Early on, people tackling such problems split into several camps. In one prominent camp are veterans of mainstream programming languages and database management systems who think the best ways to formalize the underpinnings of XML documents are the common data typing techniques with which they are most familiar. They are accustomed to thinking of all semantics in terms of the primitive axioms that make up the static data typing of mainstream languages and database systems. They feel that if they can just bind XML tightly into familiar metaphors, then they can get a grip on modeling problems.

A data typing proponent might want to touch up the schema in Listing 2 to look like the version in Listing 3.

Listing 3. Sample RELAX NG schema using WXS data types



namespace dc = "http://purl.org/dc/elements/1.1/"
element po
{
dc:description [ "Simple purchase order" ]
attribute number {
dc:description [ "Number for identifying the purchase order" ]
xsd:int
}
#The rest of the schema here
}

This time a WXS data type is assigned to the attribute, reflecting the schema designer’s assumption that the purchase order number should be constrained to an integer. That is the meaning of the line xsd:int. Clearly this addition barely scratches the surface of the problem of proper interpretation of the schema. To be fair, even data typing advocates do not claim it does, but they do claim that this added bit of precision gives processing tools the power to do other sorts of reasoning and analysis on the XML instances. I happen to think this claim is somewhat dubious, and I believe that it has siphoned much energy from the XML community towards a fruitless obsession with data types. This energy might be more usefully directed towards the problem of semantic transparency.

A more direct problem is that when people reflexively use data types, they often end up reducing flexibility in unanticipated ways. As an example, if Zenith Organization, using the schema in Listing 3, wants to trade with Acme Organization, using the schema in Listing 1, there is now the additional complication that one schema sees PO numbers as integers, and the other sees them as plain old text. This mismatch is reflected in all the data-type-aware tools. Such mismatches are inevitable in any integration project, but in this case the gain from strict data typing does not measure up to the flexibility that is lost.

What does this mean to a developer? I don’t mean to argue that you should not use schema data types — just don’t use them as a reflex. Use them to mark very carefully considered constraints that you expect to make sense throughout the life of the system. And don’t get so preoccupied with data typing that you forget to consider how to clarify the more general semantics related to your XML vocabulary.

I myself have added the ability to infer data types from text patterns in XML nodes to the Amara XML toolkit, one of the XML processing libraries I develop for the Python programming language. I am careful to make this type inference optional, and I think it’s probably dangerous to use it as a cornerstone of any processing tool chain. I’ve also given users the capability to set up custom data types in a declarative way using Jeni Tennison’s Data Type Library Language (DTLL — see Resources). DTLL helps make more explicit the fact that data typing in XML is nothing more than a specialized interpretation of text. That is the crux of the matter: XML is text, and only text. Other layers such as data typing are mere interpretations of that text (and should be optional interpretations). The moment you lose sight of that, you’re in for all sorts of unforeseen complications.

Wrap-up

Good annotations of schemata are very important, regardless of whether or not they lead to semantic transparency. It might even be enough to maintain a separate data dictionary document, of the sort familiar to database developers. For each term used in the schema, a data dictionary provides a description that informally fills in the semantics for that term. As you recognize the supremacy of text in XML, the importance of semantic transparency becomes clear. Since all XML processing is ultimately a matter of interpreting language, it is essential to find ways to reduce the ambiguity of that interpretation. If you have any thoughts on schemata, schema annotations, data typing, or related topics, please share them by posting on the Thinking XML discussion forum.

Resources

  • Participate in the discussion forum on this article. (You can also click Discuss at the top or bottom of the article to access the forum.)
  • Visit the Dublin Core metadata initiative, which maintains a metadata vocabulary influenced by library science. It is often used for stating resource properties such as titles, description, authorship, copyright, and so on.
  • The author has complained about the obsession with data typing in XML technologies many times in the main community forums for XML experts. In "More on XML class warfare" (O’Reilly Developer Weblogs, January 2003), he summarizes the most important such thread that he’s been involved with, which stemmed from his article "XML class warfare" (Application Development Trends magazine, December 2002). He also touches on the tension between the different factions with regard to data types in schemata in "Battle of the Bulging Standards" (XML Journal, September 2002).
  • Check out the Amara XML toolkit, which is Python software for XML processing. Among many other features, it supports data type inferencing and Jeni Tennison’s Data Type Library Language (DTLL), which allows you to specify custom data types for XML.
  • Find more XML resources on the developerWorks XML zone, including previous installments of the Thinking XML column. If you have comments on this installment or any others in this column, please post them on the Thinking XML forum.
  • Browse a wide range of XML-related titles at the developerWorks Developer Bookstore.
  • Learn how you can become an IBM Certified Developer in XML and related technologies.
  • 2010-05-26T11:19:15+00:00 May 16th, 2005|XML|0 Comments

    About the Author:

    Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is also a lead developer of the Versa RDF query language. He is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact Mr. Ogbuji at uche.ogbuji@fourthought.com.

    Leave A Comment