Archive for the ‘Graphs’ Category

Science Blogging Conference in January

Thursday, November 29th, 2007

The 2008 Science Blogging Conference is coming up January 19. I attended last year and will be co-leading a session this year on “Public Scientific Data”. My selfish interest in public data is wanting to try to improve the visualizations I see in science papers, but I can’t readily do it without the data. The other discussion leader, Jean-Claude Bradley, is a real scientist, though, running a real open science chemistry lab at Drexel.

I started a skeleton outline of some of the issues at the conference wiki, and I’m happy to see others have expanded it.

Chapel Hill 2007 Town Council Election Graphs

Wednesday, November 7th, 2007

Here is a summary graph of the 2007 Chapel Hill Town Council elections [source data]. Each horizontal bar is a precinct, and the heights are proportional to the number of votes cast in each precinct. Each color is a different candidate, and the sub-bars widths are proportional to the percent of votes that each candidate got in the corresponding precinct. Precincts that had too few votes to show up well have been omitted (Country Club, Hillsborough, Mail-in, and Provisional)

Chapel Hill 2007 Town Council Graph

With so much information it’s hard to pick out too many details. However, it’s easy to see how Ward and Raymond did in each precinct since they’re on the edges and have a stable baseline.

The order of the candidates is by overall finish position, and the order of the precinct is by percent of total Czajkowski/Hill votes that went to Czajkowski. That makes it a little easier to see how those two candidates matched up in the battle for the fourth seat.

Czajkowski got the final spot on the council in a close race, though there is quite a bit of variation among precincts. Not sure if the variation is due to political make-up of the precincts or to targeted campaigning.

I notice my precinct, Estes Hills, was one of the best for my friend Will Raymond. Maybe my yard signs helped!

Update:Here’s another view, highlighting the broad range of results for Czajkowski. Each circle is a precinct.

Chapel Hill 2007 Town Council Oneway Graph

And a similar one, this time with circle size being proportional to total precinct votes and color being set based on the Czajkowski/Hill ratio. Red precincts went for Czajkowski and blue for Hill. You can see some correlation among the incumbents and among the challengers.

Chapel Hill 2007 Town Council Bubble Plot

Income Share Graph

Monday, September 24th, 2007

After my last graph analysis, a reader asked that I review the following graph from a New York Times columnist’s blog post on income disparity.

Income Share from Krugman

Overall, I think this graph is good. Everything is labeled, making the message clear, but there are several minor problems with the details.

  • The rotated year labels are hard to read, especially since they’re not on even multiples of 5s or 10s.
  • The data points and connecting lines are fighting for attention and saying the same thing. Either de-emphasize/remove the points or replace the connected lines with a smoother.
  • The grid lines are too bold — competing with the data marks.
  • The labels use inconsistent capitalization (and there appears to be a missing space in “classAmerica”). It’d be nice if all the labels were within the graph frame, too.

I don’t know enough about economics to comment on the currency of the graph’s message. One commenter suggested that the exclusion of capital gains diminishes the value of the data. I found the original paper [pdf], and the absence of capital gains seems to stem from the way the data was collected from income tax records though it is justified by calling capital gains “lumpy” and “volatile” and so presumably independent of long-term trends.

Here are a couple of attempts I made a reproducing the graph to fix the minor problems. I used GraphClick (great product) to get the data. (The original paper’s data only goes through 1998.)

Income Share (BW)

My first graph leaves off the labels, uses fainter data points and gridlines and adds a spline smoother to show trends. Labels can be added in a variety of ways to highlight sections of interest.

Income Share (annotated)

In my second graph I experiment with labeling ranges more specifically than with just a single arrow. With an arrow, it’s unclear whether it’s pointing to a single event or a section. The shading fixes that problem but adds other distractions that probably aren’t worth the price.

Petraeus Chart

Sunday, September 16th, 2007

I can’t decide if this final chart from General Petraeus’s report is the worst ever or the best ever.

Petraeus Recommended Force Reductions Chart

What makes it the worst chart? Mainly the axes.

  • unlabeled Y axis — you have to rely on the text of the report to know the Y axis is brigades.
  • no Y origin — it’s standard for bar charts to start at 0 since their values are encoded as lengths. Here, not only do the bars float, but the axis origin is not even labeled. Is 0 at the base of the bars or at the blue line or elsewhere?
  • nonlinear Y axis — the distance between 15 and 20 is noticeably smaller than the distance from 10 to 15 and from 5 to 10. The distance from the presumed 0 to 5 also very different.
  • irregular Y axis labels — it’s unusual to have labels at 5, 7, and 10 instead of evenly spaced labels such as 5, 7.5, and 10, but at least the 7 is closer to 5 than 10.
  • star images around brigade ranges add useless clutter.

What makes it the best chart? The purpose of a visualization is to communicate a message, and this chart somehow communicates the paradoxical message, “we have a detailed plan, but we just don’t know the details.” The definite bars and dated call-outs show certainty while the fuzzy axes and question marks show uncertainty. So in a sense, the chart communicates its message perfectly.

However, there’s something unsettling about the duplicity of the chart. The normal way of showing uncertainty is with confidence intervals, showing the lower and upper limits with 95% confidence. That’s hard to do with a stacked bar charts like these, so I would split the roles into two kinds, Overwatch and Combat (Partnering and Leading). These two summary roles could be plotted separately each with confidence bands.

Problem 156 Graphs

Sunday, June 17th, 2007

Problem 156 at Project Euler is one of the few without a specified limit. That is, most problems might ask for something like the sum of all solutions to an equation that are less than a billion. This one just asked for the sum of all solutions. When I first solved it, I just used the sum of solutions up to a trillion, which was good enough. Later I made some plots to help understand why there are no higher solutions.

Mild spoilers below if you’re thinking about trying the problem.

The problem is to find the solutions to f(x) = x where f(x) is the number of times a given digit, say 4, appears in all the numerals corresponding to the numbers from 1 to x. The equivalent problem is to find where f(x) - x = 0. The plots below show f(x) - x at different scales.

The first scale suggests f(x) - x for the whole range I looked at.

math156-1.png

The second scale shows where the function really takes off and why. For each 1e10 range there seems to be about as many 4s as numbers (the function keeps dipping to around 0) until it gets to 4e10 when each new number in the 4e10 to 5e10 range has at least one 4 and sometimes more, so the function really takes off, especially at 4.4e10.

math156-2.png

Zooming in more shows a fractal-like nature to the function.

math156-3.png

The varying density of the points is because my implementation makes fewer evaluations when further away from the origin.

math156-4.png

Casualties Graph

Wednesday, April 25th, 2007

Today’s Raleigh News & Observer ran an article about increased casualties within the 82nd Airborne Division, which is based here in North Carolina. There was a graph showing casualties by year, with the 2007 total being incomplete. A reader objected to the incomplete year being shown next to the complete years, and editor Ted Vaden thoughtfully opened a Reader’s Corner blog item to discuss ways the paper might have done a better job with the visualization.

I made a suggestion and later found some data at icasualties.org (which doesn’t quite agree with the N&O data from the Associate Press) to use to try out a few ideas.

First, this was my suggested idea, with one set of bars for the partial years values and another set of bars for the full year values. Here I overlaid the two since one is a subset of the other. You can at least see that the partial year total is not indicative of the full year total.

casualties1.png

Next is the same thing with the bars side-by-side.

casualties2.png

Another commenter suggested a view by month, which is revealing in showing clumps of dangerous periods for this division. The graph still has the problem that the bar for the final partial month carries the same weight as the other months. And the graph is showing a lot of data just to support a newspaper story.

casualties3.png

Given the sparseness of the data, I tried clumping the data into trimesters of 4 months each. Not bad but a trimester is not a common year division, adding a little work to decipher the date axis.

casualties4.png

Another idea is to look at average casualties per month for each year, but this has a more subtle partial data issue since 2007 has fewer values to average and so is less representative of the year. This is really equivalent to projecting the partial 2007 values for the full year, which the editor intentionally avoided.

casualties5.png

Finally, here’s a more scientific view, showing the monthly values with a trend line (spline smoother) overlaid.

casualties6.png

Thanks, Steve, for pointing me to this issue.

Heat Map Valentine

Wednesday, February 14th, 2007

Heat Map Valentine Heart

This is a heat map graph of a mathematically generated data set with a little noise added for artistic effect. Next year I’ll work on anti-aliasing the edges.