Archive for the ‘Graphs’ Category

Income Share Graph

Monday, September 24th, 2007

After my last graph analysis, a reader asked that I review the following graph from a New York Times columnist’s blog post on income disparity.

Income Share from Krugman

Overall, I think this graph is good. Everything is labeled, making the message clear, but there are several minor problems with the details.

  • The rotated year labels are hard to read, especially since they’re not on even multiples of 5s or 10s.
  • The data points and connecting lines are fighting for attention and saying the same thing. Either de-emphasize/remove the points or replace the connected lines with a smoother.
  • The grid lines are too bold — competing with the data marks.
  • The labels use inconsistent capitalization (and there appears to be a missing space in “classAmerica”). It’d be nice if all the labels were within the graph frame, too.

I don’t know enough about economics to comment on the currency of the graph’s message. One commenter suggested that the exclusion of capital gains diminishes the value of the data. I found the original paper [pdf], and the absence of capital gains seems to stem from the way the data was collected from income tax records though it is justified by calling capital gains “lumpy” and “volatile” and so presumably independent of long-term trends.

Here are a couple of attempts I made a reproducing the graph to fix the minor problems. I used GraphClick (great product) to get the data. (The original paper’s data only goes through 1998.)

Income Share (BW)

My first graph leaves off the labels, uses fainter data points and gridlines and adds a spline smoother to show trends. Labels can be added in a variety of ways to highlight sections of interest.

Income Share (annotated)

In my second graph I experiment with labeling ranges more specifically than with just a single arrow. With an arrow, it’s unclear whether it’s pointing to a single event or a section. The shading fixes that problem but adds other distractions that probably aren’t worth the price.

Petraeus Chart

Sunday, September 16th, 2007

I can’t decide if this final chart from General Petraeus’s report is the worst ever or the best ever.

Petraeus Recommended Force Reductions Chart

What makes it the worst chart? Mainly the axes.

  • unlabeled Y axis — you have to rely on the text of the report to know the Y axis is brigades.
  • no Y origin — it’s standard for bar charts to start at 0 since their values are encoded as lengths. Here, not only do the bars float, but the axis origin is not even labeled. Is 0 at the base of the bars or at the blue line or elsewhere?
  • nonlinear Y axis — the distance between 15 and 20 is noticeably smaller than the distance from 10 to 15 and from 5 to 10. The distance from the presumed 0 to 5 also very different.
  • irregular Y axis labels — it’s unusual to have labels at 5, 7, and 10 instead of evenly spaced labels such as 5, 7.5, and 10, but at least the 7 is closer to 5 than 10.
  • star images around brigade ranges add useless clutter.

What makes it the best chart? The purpose of a visualization is to communicate a message, and this chart somehow communicates the paradoxical message, “we have a detailed plan, but we just don’t know the details.” The definite bars and dated call-outs show certainty while the fuzzy axes and question marks show uncertainty. So in a sense, the chart communicates its message perfectly.

However, there’s something unsettling about the duplicity of the chart. The normal way of showing uncertainty is with confidence intervals, showing the lower and upper limits with 95% confidence. That’s hard to do with a stacked bar charts like these, so I would split the roles into two kinds, Overwatch and Combat (Partnering and Leading). These two summary roles could be plotted separately each with confidence bands.

Problem 156 Graphs

Sunday, June 17th, 2007

Problem 156 at Project Euler is one of the few without a specified limit. That is, most problems might ask for something like the sum of all solutions to an equation that are less than a billion. This one just asked for the sum of all solutions. When I first solved it, I just used the sum of solutions up to a trillion, which was good enough. Later I made some plots to help understand why there are no higher solutions.

Mild spoilers below if you’re thinking about trying the problem.

The problem is to find the solutions to f(x) = x where f(x) is the number of times a given digit, say 4, appears in all the numerals corresponding to the numbers from 1 to x. The equivalent problem is to find where f(x) – x = 0. The plots below show f(x) – x at different scales.

The first scale suggests f(x) – x for the whole range I looked at.

math156-1.png

The second scale shows where the function really takes off and why. For each 1e10 range there seems to be about as many 4s as numbers (the function keeps dipping to around 0) until it gets to 4e10 when each new number in the 4e10 to 5e10 range has at least one 4 and sometimes more, so the function really takes off, especially at 4.4e10.

math156-2.png

Zooming in more shows a fractal-like nature to the function.

math156-3.png

The varying density of the points is because my implementation makes fewer evaluations when further away from the origin.

math156-4.png

Casualties Graph

Wednesday, April 25th, 2007

Today’s Raleigh News & Observer ran an article about increased casualties within the 82nd Airborne Division, which is based here in North Carolina. There was a graph showing casualties by year, with the 2007 total being incomplete. A reader objected to the incomplete year being shown next to the complete years, and editor Ted Vaden thoughtfully opened a Reader’s Corner blog item to discuss ways the paper might have done a better job with the visualization.

I made a suggestion and later found some data at icasualties.org (which doesn’t quite agree with the N&O data from the Associate Press) to use to try out a few ideas.

First, this was my suggested idea, with one set of bars for the partial years values and another set of bars for the full year values. Here I overlaid the two since one is a subset of the other. You can at least see that the partial year total is not indicative of the full year total.

casualties1.png

Next is the same thing with the bars side-by-side.

casualties2.png

Another commenter suggested a view by month, which is revealing in showing clumps of dangerous periods for this division. The graph still has the problem that the bar for the final partial month carries the same weight as the other months. And the graph is showing a lot of data just to support a newspaper story.

casualties3.png

Given the sparseness of the data, I tried clumping the data into trimesters of 4 months each. Not bad but a trimester is not a common year division, adding a little work to decipher the date axis.

casualties4.png

Another idea is to look at average casualties per month for each year, but this has a more subtle partial data issue since 2007 has fewer values to average and so is less representative of the year. This is really equivalent to projecting the partial 2007 values for the full year, which the editor intentionally avoided.

casualties5.png

Finally, here’s a more scientific view, showing the monthly values with a trend line (spline smoother) overlaid.

casualties6.png

Thanks, Steve, for pointing me to this issue.

Heat Map Valentine

Wednesday, February 14th, 2007

Heat Map Valentine Heart

This is a heat map graph of a mathematically generated data set with a little noise added for artistic effect. Next year I’ll work on anti-aliasing the edges.

Automobile Maker Market Share Chart

Monday, November 13th, 2006

A week or so ago, Junk Charts featured a discussion (Rip Tide) of a New York Times chart of how auto-maker market share distribution in the US is becoming more like it is in Europe. The original chart showed lots of information in a pleasant way, but as usual folks want to do better — either to look better or to make the point better.

I scraped the data (csv) from the chart (thanks GraphClick), and provided a rough alternative to the graph.

auto market share distributions

My graph aims to show only enough information to support the text of the original chart. I chart ordered market share histograms for three different years so one can get a sense of how the US and European market share distributions are changing. I’m not sure how well the data supports the thesis though — it looks like both distributions are becoming more like the other rather that just US becoming more like Europe.

I just found out today that Junk Charts actually made use of the data I posted and provided yet another alternate view (Calming the Rip Tide). Interesting, but I don’t think the boxplots work since they don’t show a trend.

From Hayseed to Ubergeek

Tuesday, September 12th, 2006

What a journey! From being labeled a “hay seed” [sic] by an anonymous blog commentor to being recoginized as a “ubergeek” in print by the Raleigh News & Observer. The G.D. Gearino column in today’s paper traces his steps to track down my gender-neutral first name analysis that a fellow reporter somehow got whiff of at a bar or party.

I don’t know who leaked the exercise, but now it’s out there. I used the Wake County registered voter database to analyze gender distributions of various first names to see which one was the most gender neutral. Of course, there are lots of ways to measure neutral, but I used the statistical definition of independence, looking for the name whose female/male ratio was most similar to that of the population (53% female) with the smallest confidence interval. Casey was the top name followed by Carey.

I explored the time component, but didn’t factor it into my analysis. Just as names go in and out of favor they also change genders over time. For instance, Morgan was more male, but these days it’s more female. That is, an older voter named Morgan is likely to be male, and a young voter named Morgan is likely to be female. Most neutralish names move toward female. The only names that I remember going from female to male over time were Frankie and Robbie.

Orange County doesn’t seem to have voter names on-line — just summary statistics by precinct.