Numerals are Visualizations, too

I like looking at annual reports as a good source of data visualizations. Much of the typical report is just feel-good decoration, and the graphs usually fall into that category with lots of shine but little content. However, what caught by eye in the Public Citizen 2008 annual report [PDF], was a table of numbers (the graphs aren’t too great either).

Misaligned figures

See anything odd about the numbers? The columns are not aligned vertically because of different digit widths; in particular, the “1” digit is very narrow. As a result, the Publications and Subscriptions value seems smaller than the Grants value at first glance, since the latter number is wider.

I thought it was a cardinal rule of font design that all digits were the same width. Unicode even has a “Figure Dash” character, which is a dash with the same width as the digit characters.

I set out to find the font in question. First I sampled what I had on my Mac. I didn’t find the font, but I did find several fonts with digits of unequal width. Most of them were artful fonts like Comic Sans, but Georgia was also in that category.

Next I tried Indentifont, a clever idea for identifying a font by asking a series of questions about the characters, such as what kind of bar the “G” has. It returned a few fonts that matched by answers, but none that looked like the report text. The “1” and the “t” are particularly distinctive.

Finally I realized that with the PDF available I could just examine the file in a text editor. After searching for the word “Font” a few times, I kept seeing the word “Knockout” nearby. Checking the characters on the foundry site the Knockout font family, shows a perfect match for the font called “No. 32 Junior Cruiserweight”.

So my theory about fonts was wrong, but I still hold that tables of numbers should never contains variable-width digits.

Dual-Scaled Graph Examples

Visualizations experts says it’s generally a bad idea to put two different vertical axes on a single graph (see Dual-Scaled Axes in Graphs — Are They Ever the Best Solution? [PDF]) since it invites comparison of data on different scales. However, the treatment is still popular because of the two-in-one information density, and the distortion can be overcome with careful reading.

In the worse case, though, two completely different scales are carefully transformed to almost line up and suggest correlation. An insidious example from a few years ago was the presidential popularity versus price of gas graphs, about which one writer believes show that there’s “clearly a correlation.”

It does look like a correlation, but that’s only because the scales have been transformed to follow the same long-term path, which could be done to any two generally linear data series. There are a couple related spikes for the September 11 attacks and the Iraq War start which our eyes quickly pick up on, but otherwise the local ups and downs don’t match too well. These graphs disappeared shortly afterwards when the two trends obviously diverged (gas prices got better but Bush ratings didn’t).

What I really don’t understand, though, is this next example on employment data from My Budget 360.

leisure-vs-manufacturing1

I’ve never seen a dual-scaled graph where both scales were the same units and approximate range. What’s the point? It shifted the intersection point a little, but not enough to affect the thesis of the article. It does exaggerate the climb of the blue line (leisure employment). I can’t tell if this one is intentional distortion or just carelessness.

Burtin Antibiotic Illustrations

CHANCE magazine is running a contest to create the best illustration for a data set of the effectiveness of three antibiotics on sixteen strains of bacteria. Designer Will Burtin used this data set for a 1950s visualization.

With only five variables and sixteen observations, my first question is, “What’s wrong with just using a table?” The table in the contest description is even nicely laid out.
burtin-data

My second question is, “Best for whom?” Which illustration is best depends on the audience, which in this case might be doctors, researchers, statisticians or the general public among others.

The data shows Minimum Inhibitory Concentration (MIC, presumably in µg/ml) for each antibiotic and bacteria combination. Lower is better, indicating less antibiotic is needed to treat the bacteria. The MIC values vary widely from 0.001 to 1000, and I applied a logarithm transform for analysis, either on the data or on the graph. Besides nicely spreading out the data values, the log transformation may have a physical interpretation. If an antibiotic culture grows exponentially, then the log of the concentration is the time to grow it.

Exploring the data a little bit, the simplest visualization is a heat map, where every number is represented by a swatch of color. I don’t see much advantage over the table of numbers, except to quickly find extreme values or certain other patterns that the colors help with.

burtin-heat

Next, we might think from a researcher/statistician perspective and try to cluster the bacteria that react similarly to the antibiotics. Here’s a heat map and dendrogram resulting from a cluster analysis. The rows are colored by gram staining. It’s like the heat map above, but similar bacteria are grouped together (and the color scale is slightly different). The bacteria that are clustered close together might suggest a commonality for future research.

burtin-cluster

Since there are only three antibiotics, we can view the data as a 3D scatter plot. Here, the data markers correspond to the clusters.

burtin-3d-1

3D doesn’t work too well in static 2D media like this one since you need to be able to rotate it to see the structure. If you do rotate it, you can see that three of the clusters appear roughly in a straight line, so maybe there are really two different kinds instead of four. Here’s a view looking straight down the line.

burtin-3d-2

A scatter plot matrix shows all the 2D relationships better and is better for static presentation. It can’t show the 3D the alignment of the three clusters, but you can get a hint of it in the neomycin versus penicillin panel.

burtin-scm

For my contest entry, I decided to go with the perspective of a 1950s doctor, with the idea that a doctor treating a patient doesn’t know usually know what bacteria is causing the infection and may or may not have the results of a gram staining. With that in mind, my visualization shows the MIC for each antibiotic with the best dose for each scenario called out.

burtin-gb

The graph shows that penicillin is best for gram positive bacteria since all purple circles are below 1µg/ml for penicillin. Similarly, neomycin is best for gram negative bacteria and streptomycin is best if gram staining is unknown. A drawback of this graph is that the points are not labeled or connected. I tried a few ways to do that with labels and lines, but the graph just became too messy. If you need that much detail, you probably need the table of numbers.

After doing all that, I found Burtin’s original visualization via a NY Times article.

I hope this isn’t what CHANCE is looking for. It has little communication value except to say “Look how cool I am!” At least all the data is present, so a meticulous reader can get what information he needs. The audience for this must be a hospital administrator who needs to feel like he’s getting his money’s worth with fancy visualizations. I think it is more a work of art than of communication.

Jon Peltier has a write-up of his contest entry. It has all the data of Burtin’s original in a much better rectangular structure.

Science Blogging Conference in January

The 2008 Science Blogging Conference is coming up January 19. I attended last year and will be co-leading a session this year on “Public Scientific Data”. My selfish interest in public data is wanting to try to improve the visualizations I see in science papers, but I can’t readily do it without the data. The other discussion leader, Jean-Claude Bradley, is a real scientist, though, running a real open science chemistry lab at Drexel.

I started a skeleton outline of some of the issues at the conference wiki, and I’m happy to see others have expanded it.

Chapel Hill 2007 Town Council Election Graphs

Here is a summary graph of the 2007 Chapel Hill Town Council elections [source data]. Each horizontal bar is a precinct, and the heights are proportional to the number of votes cast in each precinct. Each color is a different candidate, and the sub-bars widths are proportional to the percent of votes that each candidate got in the corresponding precinct. Precincts that had too few votes to show up well have been omitted (Country Club, Hillsborough, Mail-in, and Provisional)

Chapel Hill 2007 Town Council Graph

With so much information it’s hard to pick out too many details. However, it’s easy to see how Ward and Raymond did in each precinct since they’re on the edges and have a stable baseline.

The order of the candidates is by overall finish position, and the order of the precinct is by percent of total Czajkowski/Hill votes that went to Czajkowski. That makes it a little easier to see how those two candidates matched up in the battle for the fourth seat.

Czajkowski got the final spot on the council in a close race, though there is quite a bit of variation among precincts. Not sure if the variation is due to political make-up of the precincts or to targeted campaigning.

I notice my precinct, Estes Hills, was one of the best for my friend Will Raymond. Maybe my yard signs helped!

Update:Here’s another view, highlighting the broad range of results for Czajkowski. Each circle is a precinct.

Chapel Hill 2007 Town Council Oneway Graph

And a similar one, this time with circle size being proportional to total precinct votes and color being set based on the Czajkowski/Hill ratio. Red precincts went for Czajkowski and blue for Hill. You can see some correlation among the incumbents and among the challengers.

Chapel Hill 2007 Town Council Bubble Plot

Income Share Graph

After my last graph analysis, a reader asked that I review the following graph from a New York Times columnist’s blog post on income disparity.

Income Share from Krugman

Overall, I think this graph is good. Everything is labeled, making the message clear, but there are several minor problems with the details.

  • The rotated year labels are hard to read, especially since they’re not on even multiples of 5s or 10s.
  • The data points and connecting lines are fighting for attention and saying the same thing. Either de-emphasize/remove the points or replace the connected lines with a smoother.
  • The grid lines are too bold — competing with the data marks.
  • The labels use inconsistent capitalization (and there appears to be a missing space in “classAmerica”). It’d be nice if all the labels were within the graph frame, too.

I don’t know enough about economics to comment on the currency of the graph’s message. One commenter suggested that the exclusion of capital gains diminishes the value of the data. I found the original paper [pdf], and the absence of capital gains seems to stem from the way the data was collected from income tax records though it is justified by calling capital gains “lumpy” and “volatile” and so presumably independent of long-term trends.

Here are a couple of attempts I made a reproducing the graph to fix the minor problems. I used GraphClick (great product) to get the data. (The original paper’s data only goes through 1998.)

Income Share (BW)

My first graph leaves off the labels, uses fainter data points and gridlines and adds a spline smoother to show trends. Labels can be added in a variety of ways to highlight sections of interest.

Income Share (annotated)

In my second graph I experiment with labeling ranges more specifically than with just a single arrow. With an arrow, it’s unclear whether it’s pointing to a single event or a section. The shading fixes that problem but adds other distractions that probably aren’t worth the price.

Petraeus Chart

I can’t decide if this final chart from General Petraeus’s report is the worst ever or the best ever.

Petraeus Recommended Force Reductions Chart

What makes it the worst chart? Mainly the axes.

  • unlabeled Y axis — you have to rely on the text of the report to know the Y axis is brigades.
  • no Y origin — it’s standard for bar charts to start at 0 since their values are encoded as lengths. Here, not only do the bars float, but the axis origin is not even labeled. Is 0 at the base of the bars or at the blue line or elsewhere?
  • nonlinear Y axis — the distance between 15 and 20 is noticeably smaller than the distance from 10 to 15 and from 5 to 10. The distance from the presumed 0 to 5 also very different.
  • irregular Y axis labels — it’s unusual to have labels at 5, 7, and 10 instead of evenly spaced labels such as 5, 7.5, and 10, but at least the 7 is closer to 5 than 10.
  • star images around brigade ranges add useless clutter.

What makes it the best chart? The purpose of a visualization is to communicate a message, and this chart somehow communicates the paradoxical message, “we have a detailed plan, but we just don’t know the details.” The definite bars and dated call-outs show certainty while the fuzzy axes and question marks show uncertainty. So in a sense, the chart communicates its message perfectly.

However, there’s something unsettling about the duplicity of the chart. The normal way of showing uncertainty is with confidence intervals, showing the lower and upper limits with 95% confidence. That’s hard to do with a stacked bar charts like these, so I would split the roles into two kinds, Overwatch and Combat (Partnering and Leading). These two summary roles could be plotted separately each with confidence bands.