Archive for the ‘XML’ Category

XML Schema Type Tables and Substitution Groups

Sunday, February 10th, 2008

The XML Schema 1.1 was already running behind when I left the Working Group in 2004, and it’s still a work in progress. Though I no longer write XML tools, I try to keep up with the group’s activities and provide hopefully useful comments to public working drafts. However, knowing the WG is so far behind schedule, I’m hesitant to make too many official comments since each comment must be addressed by the group, adding to the delay.

Many of my comments have been resolved recently in a relative flurry of activity. (The comment archive shows more activity last month than any previous three month period.) When a comment is resolved, the original poster can (silently) accept the resolution or appeal to the W3C director. I disagreed with the resolution of my comment on type tables and substitution groups, but just registered my dissent and closed it anyway rather than appeal — I trust the working group’s expertise over my casual interest.

Substitution groups have always been questionable in my mind. I’d prefer typed wildcards or, at least, an opt-in mechanism rather than opt-out for substitution groups to limit their unintended use.

Type tables is a big feature added late in the game and doesn’t seem to interact well with substitution groups. Type tables allow alternative types to apply to an element based on its context, such as an attribute value. I thought such context-based constraints should be in a separate layer, as is done with Schematron, but it seems like half the schema-dev questions are about how to impose such constraints within XML Schema, so I can understand why the Working Group would want to add it.

The problem, as I see it, is that type alternatives live as element declaration properties rather than within the type hierarchy. Substitution group members must have types in proper derivation relationships, but that only applies to the declared types, not the alternatives types. So combining type tables with substitution groups can break the spirit of the derivation hierarchy, if not the letter of it.

Reconstituted XML Schema Graphs

Tuesday, March 14th, 2006

The paper Analysis of XML schema usage provides a glimpse at some interesting data for a group of schemas that the authors analyzed. Unfortunately, it’s only a glimpse as the data is not provided and the summarizing graphs are generally lacking. And my correspondence with one of the authors indicates that none of the data is available.


Original graph of XML Schema SizesThe first graph, shown here in reduced form, is especially inappropriate. The authors use a scatterplot to show the distribution for schema sizes. To read it, you have to count dots within each horizontal division, as described in the notes for the plot.

Schema Size HistogramNot to be deterred, I reversed-engineered the data from the graph and regraphed it as a histogram, a boxplot and a smoothed density curve, which are all better than a scatterplot for analyzing a distribution of one variable. Unfortunately, JMP doesn’t handle log axes for histograms so I had to graph the log of the size instead of the size. The graphs in the paper obviously use Excel, and maybe it has the same deficiency. The paper uses the original graph to conclude that the bulk of the schemas have sizes in the range of 10KB to 10MB, or 101 KB to 104 KB, though the histogram helps tighten that to the range 101.5 KB to 103.5 KB, for what it’s worth.

Schema Size by LOCThe paper next shows a similar scatterplot (not shown here) of LOC and argues that the similarity of the plots verifies the high correlation between KB and LOC. Not that the conclusion is bad, but why not plot them against each other to show a correlation? The graph at right does just that, showing the fitted line on a log-log scale. Once again, it’s from the reconstituted data.


Oh yeah, I guess I better provide the data to back up my plots; it’s in xsd_reconstituted.csv.



This is not the first time I could have used a graph scraper — is there such a beast? That is, a program that scans a graph and outputs a table of data that could have produced the graph.

XML Schema Processing Diagrams

Monday, November 21st, 2005

During my brief tenure as editor of the XML Schema 1.1 Primer, I added a general description of schema processing and a couple of explanatory diagrams. Though I made them in the vector-based drawing program, OmniGraffle, I could only provide bitmap versions since the program’s vector format was proprietary. The latest version, however, adds an SVG export feature, and I’ve finally gotten around to posting the diagrams to the Schema comments list, in case the Working Group has any interest in using them.

The first diagram shows the common view of schema processing: one xml instance document validated against a schema to produce a true/false result, valid or invalid.

SVG Image



The second diagram shows the schema processing model more fully, where the schema is really a composition of zero or more schema documents and schema information from other sources, such as a repository, and the result is a “Post-Schema-Validation-Infoset” (PSVI). This PSVI is just an augmented version of the original XML instance document (its Infoset actually) so that most every tag now has metadata indicating its type and validity, among other things. Taking the whole PSVI into consideration, a true/false flag is not really enough to describe the outcome of validation—it is possible for a document to be partially valid, even.
SVG Image

I haven’t been following the XML Schema WG lately, and I’m not even sure if the next version of the specification will have a Primer.

XML Schema Wiki

Monday, September 19th, 2005

When I was involved with the XML Schema Working Group, I sometimes suggested a wiki as a way to hash out some of the more difficult issues, but nothing ever came of it. I don’t know the circumstances of its creation, but now there is a public wiki space for XML Schema.

Sadly, no one has made any edits after the initial contributions. I guess the once-confusing topics have largely been handled by implementors so that most users don’t need to worry about them.

I finally got around to “putting my money where my mouth is” and contributing a page on Unique Particle Attribution. It took much more time than I expected and still needs some work, like how to disambiguate content models, but it’s something.

… I wonder what ever happened to that promising start-up whose mission was to resolve UPA violations … Was it nondeterminism.com? …