Archive for March, 2006

Good Forsythia — Bad Forsythia

Wednesday, March 22nd, 2006

Natural forsythia

It’s great to see the forsythia plants blooming, but some are better than others. The photo above is a good forsythia, and the photo below is a bad forsythia. Forsythia grow and bloom on long shoots and aren’t meant to be shaped like boxwoods. Please prune with care.

Badly pruned forsythia

Reconstituted XML Schema Graphs

Tuesday, March 14th, 2006

The paper Analysis of XML schema usage provides a glimpse at some interesting data for a group of schemas that the authors analyzed. Unfortunately, it’s only a glimpse as the data is not provided and the summarizing graphs are generally lacking. And my correspondence with one of the authors indicates that none of the data is available.


Original graph of XML Schema SizesThe first graph, shown here in reduced form, is especially inappropriate. The authors use a scatterplot to show the distribution for schema sizes. To read it, you have to count dots within each horizontal division, as described in the notes for the plot.

Schema Size HistogramNot to be deterred, I reversed-engineered the data from the graph and regraphed it as a histogram, a boxplot and a smoothed density curve, which are all better than a scatterplot for analyzing a distribution of one variable. Unfortunately, JMP doesn’t handle log axes for histograms so I had to graph the log of the size instead of the size. The graphs in the paper obviously use Excel, and maybe it has the same deficiency. The paper uses the original graph to conclude that the bulk of the schemas have sizes in the range of 10KB to 10MB, or 101 KB to 104 KB, though the histogram helps tighten that to the range 101.5 KB to 103.5 KB, for what it’s worth.

Schema Size by LOCThe paper next shows a similar scatterplot (not shown here) of LOC and argues that the similarity of the plots verifies the high correlation between KB and LOC. Not that the conclusion is bad, but why not plot them against each other to show a correlation? The graph at right does just that, showing the fitted line on a log-log scale. Once again, it’s from the reconstituted data.


Oh yeah, I guess I better provide the data to back up my plots; it’s in xsd_reconstituted.csv.



This is not the first time I could have used a graph scraper — is there such a beast? That is, a program that scans a graph and outputs a table of data that could have produced the graph.

Math Challenges Done

Sunday, March 5th, 2006

Maths Challenges Progress GraphI finished the last of the mathschallenge.net math programming problems. Actually, it’s a temporary milestone since new problems are added every few weeks. At right is a graph of problems started per day with a LOESS smoother applied. The data are from the creation date of the program files, and the few problems that I solved without coding are not represented.

A lot of the problems involved combinatorical counting, so it helped that I had just been reading the excellent lecture notes from MIT’s Mathematics for Computer Science course.