Makeover Monday 2020/W1

It’s been a year since my last official MakeoverMonday entry. I’m finally realizing that most if the action is for Sunday, so maybe I’ll do more this year. Week 1 for 2020 looks simple but it’s already confusing me. The task is to makeover this Vox chart from 2014:

The chart shows Gallup poll responses for favorite sports to watch over 70+ years for the three most popular sports (US only). However, the Makeover Monday data covers responses for 19 sports but only seven polls spanning 14 years. So perhaps I’m already breaking the rules, but I’m going to use the full data since it’s available on the same sourced page at Gallup. That includes only seven sports, but others are tiny and can be ignored for this makeover.

Review of the original

I like main design decisions of the original:

  • Showing trends over time
  • Dropping less popular sports to focus on the main sports
  • Smoothing the trend lines
  • Labeling the lines directly instead of with a separate legend
  • Trying to use semantic colors — I didn’t realize it until I tried to pick semantic colors myself: football fields are green; basketballs are orange; baseball bats are yellow.
  • Abbreviating the years so the x axis is not so crowded.

Oddities of the original:

  • Uneven amount of trend line smoothness
  • Y axis labels and gridlines are at 10% except, the top one is at 13%.
  • Labels colors don’t match the lines and are not quite aligned with the ends of the lines.
  • Putting the y axis labels above the ticks/gridlines instead of inline with them is not that uncommon, but it still takes me longer to parse the positions.

The uneven smoothness was the most prominent feature for me. At first, I read it as saying the change had been steady for decades before starting to fluctuate in the internet era. However, I realized it was more likely that the poll was conducted less frequently in the past, which is indeed the case.

Data exploration

Continuing that thought, let’s look at all the data values for those sports. Here’s a remake using the same technique as the original, connected line with smooth connections, but also showing the data points.

This matches pretty well, except for 1972 and 1994 when the polls were conducted twice each year. It looks like the Vox author ignored one of the polls in each of those years. Also, the data I retrieved has an additions year of data (2017) after the Vox article came out in 2014.

Beyond the granularity, the shared data includes seven sports instead of three, adding ice hockey, soccer, auto racing and figure skating. Of those, only ice hockey and soccer had more that 2% of responses.

The dates are given as month/year values which distinguishes the multiple polls taken in some years. One year, 1997, the poll question was “What is your favorite sport to follow?” instead of “to watch.” The results weren’t that different, but I can imagine quite different interpretations.

Though I didn’t use the official 19-sport data set which only goes back to 2004, I noticed it also tallied responses such as “other” (about 5%) and “none” (about 13%). I can dream that ultimate frisbee accounts for a decent chunk of other, but unlikely. I’m sure it would be up there for a question on “your favorite sport to play.”

Graph makeover

I do think the long-term trends for the main sports over time is a good message, so I sought to show that while minimizing the recognized oddities. The most straightforward thing to do is a scatterplot and a real smoother (in this case a spline regression):

The data marks help communicate the irregular polling and variation but also add a bit of visual noise. I didn’t try abbreviating the years, and I didn’t put a lot of effort into lining up the line labels. One downside of attaching my labels to data points in the graph is that I had to expand the graph which means 2020 and beyond is now visible on the date axis. Not terrible but seems like a negative.

I added the next two sports for a fuller story and since soccer seems to be really gaining this past decade. And for some pedantic reason, I dropped the 1997 responses when the poll question had a slightly different wording. Didn’t want to have to add an asterisk to chart title.

Another way to show the irregular polling would be to show vertical lines on the polling dates.

Not bad — I hadn’t really noticed how the frequency had dropped off in the last 10 years. We’ve lost any indication of the variation in the responses, though. We can get an estimate by adding a bootstrap confidence interval to the spline regression.

There’s some argument for only showing the confidence band.

Not sure I like that, but maybe I’m just not used to it. I’ll compromise and go with a thinner trend line.

For this last graph, I put a little more effort into lining up the line labels without extending the axis.

More data exploration

Though my chart has a small text summary in the subtitle, I don’t speculate on the why of the trends. The Vox article suggests the creation of the Super Bowl and modern NFL were catalysts for the shift from baseball to football. I imagine the rise of TV viewing was an issue, where football may be more accessible or more fun to watch with TV. And the recent rise of soccer in the US could be related to the rising Hispanic population, the success of the women’s national team, or just more internationalization in general.

The Gallup data also includes the month of the poll, which I showed in my charts as the 15th of the given month. One might also wonder if the popularity of a sport depends on whether the sport is in season or not during the poll. Unfortunately, there’s not enough data and month variation to read too much into it. Most of the recent polls have been done in December in the thick of football season. I did try a linear model with month as a separate factor, and a few month-sport interactions had p-values less that 0.001. For instance, the effect on football of polling around March is about negative three percentage points.

The end

I not even sure my work qualifies since I didn’t use the official data set, but I think I’ll submit the last chart above as my entry.

Graphs tweeted in 2019

Here are 166 graphs I made and tweeted in 2019. There’s minimal commentary, and I copied them from my tweets rather than track down the originals, so not sure about the image quality. I omitted a few near-duplicates. All were made with JMP except where noted. Each image should be hyperlinked to its tweet (but I probably missed a few).

January

I started out 2019 with some data I collected from the Radio Paradise online playlist. This packed bar chart of artists is not a great fit for packed bars, which is a good sign for the radio station: it means the artist distribution is not very skewed.

I also collected the song ratings, hoping to understand why they keep playing Joni Mitchell.

I tried a couple variations on hockey attendance data for #MakeoverMonday. I still haven’t figured out the right way to participate there.

More Radio Paradise data.

Trying a few alternative ways to compare two small distributions.

Still milking that Radio Paradise data.

These next four views of the data data led to collaborating with Nick Desbarats on A Friendlier Dot Plot blog post.

A fun chart I made with Google Trends.

A couple alternative views of some tea survey data.

Four alternatives to some political radar charts I saw on 538.com. I don’t find radar charts useful in general, but maybe everything else is worse.

February

Recreating a W. E. B. Du Bois map, first mimicking his colors and then coloring by a data value.

I turned the dot plot alternatives into an Observable notebook. Here are some results that I shared.

Here’s my first plot of sign-ups for the local Analytics>Forward unconference.

Charting a colleague’s movie ticket price history.

March

I tried this month’s Story Telling With Data challenge, looking at country-to-country financial aid.

Playing with force-directing dodging on Observable.

A couple final Analytics>Forward attendance charts.

My full submission for the #SWDchallenge.

Remade a pie chart grid as a heatmap.

Trying a static view of one of the bar chart race data sets.

April

I discovered I could download my water usage data.

I made an Observable notebook for making packed bar charts.

Trying to reproduce a strange regression line in a NYT graphic. This proved to be a rich data set and would later become the example data for my JSM interactivity presentation.

May

Another radar makeover.

Trying to remake a suspicious journal plot.

Makeover of some soccer league rankings.

Carbon-dioxide emissions are so skewed you can see the top five and all the other countries packed in the same graph.

I made and remade a few charts from this paper on gender and the effect of temperature on cognitive performance.

With some effort, I was able to collect my crossword puzzle solving times. I later made good use of the data in my JMP 15 keynote segments in Tucson and Tokyo.

June

Found the Global Power Plant Database.

More crossword graphs. This time comparing ways of comparing distributions.

College majors for data scientists from a Twitter poll.

Started looking at Greenland ice melt data. The first graph just verifies that I was reading the gridded data values correctly, but I ended up switching to a different source with summary values for each day.

A word cloud from Apple keynote transcripts.

Another journal chart makeover.

July

Testing the limits of packed bars on audiobook counts. Is 26 too few items? Is 69,321 too many?

Some results I graphed from a salary survey of statisticians.

I also make and share graphs on Cross Validated Q&A site. Here’s one I also tweeted, simulating overlapping bars.

A makeover of a bar-mekko chart.

Demonstrating bars with labels inside the bars.

A way to show gains and losses along with the net result.

A makeover of a questionable ISOTYPE graph from UNC.

Teaser graph for the blog post to go along with my JSM talk on interactivity modes in data visualization.

August

My one graph creation from attending JSM: sponsor booth space.

Another #MakeoverMonday data set: Britain’s power generation.

Not including them here, but I shared a dozen animated GIFs showing the nine interactivity techniques from my JSM talk.

Another #MakeoverMonday data set: clinical trials counts.

Trying to breakdown UK suicides data. Thought it might be a baby boom effect, but there seems to have been less of a baby boom in the UK.

I made this comparison of early box plot forms.

A zoomed-in axis inset to explore showing both long and short time scales.

I took the coal production data Tukey used for the smoothers in his book and tried a few other smoothers.

Violins versus Highest Density Region plots:

September

Updated Greenland ice melt cumulative view.

Verifying that the bars in a NYT Notre Dame graphic were sized on a square root scale instead of linear.

Checking default Y axis scaling for a line plot in JMP.

Exploring how packing affects aggregate size.

Looking at school district diversity data from a Washington Post graphic.

October

Discovered livestock data at UN’s Food and Agriculture Organization site. Showing 4 of 11 packed bar charts here.

Comparing the effects of aggregation on regression after seeing an odd analysis in Significance magazine. They later published this graph in their correspondence section.

Tourism in Portugal

Some alternatives to a truncated bar chart I saw in a paper.

A mini-gallery I made to show some new JMP 15 features.

I remade an emoji pie chart as packed bars and then someone suggested packed circles, so I tried that, too.

November

Looking at UK Conservative votes versus deprivation measures.

After a marriage statistics paper shared their data and details, I was able to reproduce the results.

Here’s a sampling of several graphs I posted in a study of a chord diagram (see also my previous blog post).

Makeover of a study paper, removing a dubious log scale.

Answering a question about histogram binning.

This underground water leak took a while to find and fix, but it was nice to see the data, at least.

Making an example geographic scatter plot.

Trying out a shading idea from Len Kiefer.

Data from Steam gaming usage.

Looking at data from an Economist graphic about earning for college graduates versus the colleges’ admission rates.

December

Despite knowing next to nothing about UK politics, I tried some graphical reproductions and explorations based on an Economist ternary chart.

Comparing stats masters programs. I’ve still don’t know why Columbia is so far above the others in number of students.

Remake of an animated ozone chart.

Makeovers of a Reuters bar-mekko chart.

Restyling a journal’s scatter plot..

Alaska area comparison.

Relationship chord diagram study

I saw this “relationship chart” in a 10 Visualizations Every Data Scientist Should Know by Jorge Castañón and was intrigued. I’ve never found these diagrams very valuable, but I was eager to learn where they are useful. Maybe I just needed the right data. In this case, the data consists of patient attributes in a drug study.

Each node is an attribute value and each curved line between two nodes represents patients having both attributes, with line thickness corresponding to the number of such patients. The article listed three insights:

  1. All patients with high blood pressure were prescribed Drug A.
  2. All patients with low blood pressure and high cholesterol level were prescribed Drug C.
  3. None of the patients prescribed Drug X showed high blood pressure.

How does the diagram support these statements? Not very well. It turns out some of these “insights” are not even true, let alone easy to discern. Claim #1 is false because there is a line from high blood pressure to Drug Y. Claim #2 describes a three-way relationship, which is not generally represented in the chart. I downloaded the nicely-provided raw data, to find the claim was actually false. Claim #3 is true because there is no line connecting Drug X and high blood pressure.

Likely the errors were editing mistakes or draft mix-ups, but the fact that an advocate for the usefulness of the chart didn’t notice the errors suggests the charts aren’t that insightful after all.

When I pointed out the errors on Twitter, the author immediately correctly them in the article, which was great to see. Now the claims read:

  1. All patients with high blood pressure were about equally prescribed Drug A and Y.
  2. Drug C is only prescribed to low blood pressure patients.
  3. None of the patients prescribed Drug X showed high blood pressure.

(I just realized that insight #3 is redundant given insight #1.)

Even is the chart type is not that effective, it could be that all other options are worse. That is, maybe seeing relationships between eleven attributes is too much for quick graphical understanding. So let’s try some alternatives.

The most obvious alternative is a graphical adjacency matrix since the original is a node-link graph and any node-link graph can be represented as an adjacency matrix. Here each square represents the number of patients with the X and Y axis values in common.

The missing squares certainly pop out better than the missing lines of the original for claim #3. To test claim #1, find BP/HIGH on the Y axis and scan across for the drug values. Drug A and Drug Y have about equal sized rectangles.

Since the data size is relatively small, we can replace the rectangles with grids containing one dot per patient.

The take-aways are generally the same but with a little extra precision since you can count dots if you like.

These eleven attributes are not all independent — they represent four variables with two to four values each. I’ve taken advantage of that in the axis layouts above, and can go further by using a different chart type, parallel sets, a generalization of parallel coordinates for categorical data.

Is that useful? Less so, I think. The lines do help support the connection concept, but the usefulness depends on the arrangement. Relationships between adjacent axes can be discerned but others can’t. However, it can be useful if you interact with it.

To test claim #1 we can click on the BP/HIGH value and see that those patients got both Drug A and Drug Y.

To test claim #2, select the combination of BP/LOW and Cholesterol/HIGH to see that both Drug C and Drug Y were included.

To test claim #3, select Drug X and see that none of the BP/HIGH group is highlighted.

I’m still not sold on radial relationship charts and prefer the matrix as a static view, perhaps adding marginal indications of the size of each group which would correspond to the circle sizes in the original. But the radial charts are so popular I feel like I must be missing something and will keep studying.

Bar-mekko chart study

The September edition of Andy Kirk’s Best of the Visualisation Web includes a Reuters Graphics article, India is running out of water, which explores India’s water sources. The main sources are groundwater (from wells) and surface water (from lakes and rivers). The article shows how some regions are using more groundwater than is being replenished. I feel like I learned a lot about the water supply in India, so I consider it a successful article.

However, one chart puzzled me: this graphic comparing India with select other countries.

The form is called “bar-mekko” as a hybrid between a (stacked) bar chart and a marimekko or mosaic chart. That is, the bars have variable breadth according to some other variable. (I’ll use breadth and length as the rectangle dimensions, thinking height and width are more ambiguous for horizontal bars. For this chart, breadth is along the Y axis and length is along the X axis.)

I’m not sure bar-mekko is a good chart form in general, but I found this one particularly troublesome. Ignoring the color stacking as an orthogonal feature, three different quantities are visually represented by each rectangle: the breadth, the length and the area. But in this case, the area has no meaningful interpretation. Area is population × water use. The viewer sees area differences but has to ignore them, which is extra decoding work. For instance, USA is not far behind India and China in total usage (bar length), but you might think otherwise at first since it’s much smaller by area,

On Twitter, I suggested this rule for bar-mekko charts:

Area must have a meaningful interpretation.

Or more generally, for all charts:

All visual encoding channels in use must have meaningful interpretations.

To better understand the bar-mekko form, I set out to make a bar-mekko chart with this data where breadth, length and area all had meaningful interpretations.

Getting the data

The article cites several data sources, and I found most of the country-level data from the UN’s FAO AQUASTAT database. For some reason India was not there, so I read those values off the original chart. In the process of understanding the data, I found two errors.

I downloaded recent water use data for all available countries and checked the top groundwater users to see if there were any other major groundwater users. India and USA were missing, but I was surprised that the Republic of Moldova was the top withdrawer of groundwater.

For a moment, I thought Moldova might have massive wells that supply all of Eastern Europe, but looking closer at the data, I saw that Moldova’s previous groundwater usage figure was about 1000x less, 0.129 versus 126. So likely it was a misplaced decimal point. I reported the issue to the AQUASTAT contact, and they responded quickly, confirmed the issue and quickly corrected the online database. Yea!

The other data error was discovered reading the India values from the original chart. Its X axis is in millions of liters. 600 million liters per year doesn’t seem like much for a country with over a billion people. Even as daily usage that wouldn’t be much. I now think it should be trillions of liters instead. The AQUASTAT data is in billions of cubic meters, and I suspect the author divided by 1000 instead of multiplying by 1000 to convert cubic meters to liters. However, I haven’t heard a response after sending a message to that effect.

Remake as bar-mekko

To follow the meaningful area rule, I sought to keep population as the rectangle breadth and use per-capita water use as the length, which would make area correspond to total usage.

It works, but the main message of the article, comparing India’s total water usage, is no longer so prominent. It’s still there, both in the area sizes and in the rank order of the Y axis, but the length of the bar along the X axis is most noticeable and easiest to compare. So the main message is the per-capita usage.

Other features of note: I’m not sure why this subset of countries was chosen for the original article, but I added Pakistan since it also has a lot of water usage and is near India. Oddly, Pakistan’s per-capita usage is more like the USA. Also, the unlabeled bar is Spain. Labeling is a challenge with variable-breadth bars; I could have used a tiny font like the original did, or squeezed in a special label, but I was too lazy.

Remake as mosaic

A traditional mosaic chart is more about proportions.

The breadths correspond to water usage for the entire country, the lengths correspond to the relative breakdown within country, and the areas correspond to the water usage for that country and source. Now it’s the proportion of each water source that’s easiest to compare.

Remake as bars

Regarding the general value of bar-mekko charts, Dan Zvinca noted on Twitter:

I believe that it is always easier (and uniform) to compare components and the results of any math calculation (that includes basic arithmetic operations as well) than encoding the components and the results in one view.

That’s the thinking behind my bar chart version.

Now it’s easy to compare both the total water usage and the groundwater usage among countries without the area distraction or the tight labeling challenges.

Crossword solving times

I’ve been doing crossword puzzles on my iPad over breakfast (usually oatmeal) for the past few years. I got a subscription to the weekly American Values Club crosswords and would also do some free puzzles. Last year, I went in for a subscription to the New York Times crossword puzzles when they were having a discount, and I’ve been doing those almost daily.

The NYT crossword is bit stricter than my general iPad Crosswords app. For instance, the NYT app doesn’t make it easy to look up clues via Google, which is odd since their solving guide touts “It’s not cheating, it’s learning” and quotes a former editor saying, “It’s your puzzle. Solve it any way you like.” And you can’t check your answers until the next day.

After each puzzle solved, the NYT app will tell you your time and solving streak (only counting same-day solves), which does provide a little extra motivation. I’ve often wanted a download button to get access to my past data to measure the day-of-week difficulty increase and to track progress over time. Only recently I realized that each day’s puzzle has a unique URL and I could visit past puzzles, solved or still in progress. And that made me look for a way to automate those visits to collect my timing data and solved status.

My usual technique of running a JMP script to get the raw HTML for the page didn’t work here, because it didn’t have my authentication for the NYT site to identify me. Perhaps there is a way to pass along that info in the URL or the payload, but I wasn’t hopeful . Instead I looked at automating Chrome to load each page for me. It turns out AppleScript is still around for automating MacOS apps, and it even supports a JavaScript syntax now. Not that I know JavaScript that well, but there’s plenty of help for it online.

I uploaded my resulting AppleScript/JavaScript as crossword_times.js on GitHub in case anyone is curious. Getting Chrome to load a URL was fairly straightforward, but getting the data was a bit trickier. I had to study the HTML page to find the timing text and done status, and then get Chrome to execute an XPath query to retrieve the info I wanted. That meant having my script send another script to Chrome for execution, which required turning on an option to allow that in Chrome settings. I put all that in a loop and dumped the data as JSON for import into JMP.

Looking at the entire 16 months of data doesn’t show much, except that there were more at the beginning that I didn’t even start (those red x marks along the bottom). Zooming in on recent times, you can get a sense of the daily difficulty pattern.

Breaking out the weekdays, you can see both how the difficulty increased throughout the week and maybe that I’ve been improving over time.

The trend lines are only based on the solved puzzles. There exist statistical techniques to incorporate the unsolved times into the trend by treating them as censored values, but I didn’t have much luck with it, apparently because I often gave up early and with irregular amounts of effort in general.

The week-end puzzles are harder and/or bigger and I put them on a separate scale. Looks like I’m not improving too much on these. ?

I shared these images and more on Twitter last month, and my data is available as crossword_times.csv on GitHub and in interactive form on JMP Public, where you can filter by day of the week, for instance.

Code Jam 2019 Round 1B/1C

After just missing advancing out of Google Code Jam Round 1A, I tried again in Round 1B and again in Round 1C, but with no success. I guess I haven’t adapted well to the new interactive problems. In both cases, I got the first problem solved and then got stuck on the second problem, which was interactive in both cases. The interactive problems require your submission exchange messages with another program to work out the solution, which makes it a bit different to debug.

In the interest of advancing, I should have moved on to the third problem or just did the easier subset of the interactive problem. But I’m only doing this for the fun of the challenge, and it was fun to eventually work out the solution, even though it took me a little longer than the allotted time.

Code Jam 2019 Round 1A

I tried the 2.5-hour Round 1A Friday night and just missed the cut-off for advancing. The round started a few minutes late as the problems site was overloaded at first. When it did respond, I got the problems in a different order, with the last problem listed first. Thinking it was the first (and therefore easiest) problem, I started on it without noticing the difficulty scores. That “Alien Rhyme” involved finding pairs of words with common suffixes. There were a few gotchas to the greedy approaches, so I was lucky to work out a correct algorithm, but it took me a few tries to get a decent data structure for organizing the words by common suffixes. I ended up with a vector of maps, one per suffix length with each map itself mapping a suffix to a set of words. Amazingly, it worked.

Next I tried the Pylons problem which boiled down to finding a complete path through a graph. I couldn’t think of a good approach for the large 20×20 constraint, so I went for partial credit with a brute force solution which would work for the small 5×5 constraint. After that was submitted I noticed that the large graphs were dense enough to have many complete paths, so maybe a randomized brute force approach would work. I added a little randomization and resubmitted. It turned out I had the right idea, but I didn’t randomize it enough (only the starting points and not the node visitation order). So I still only got partial credit, but with a penalty for making a second submission.

The remaining problem was an interactive one and I had too little time left for the overhead of testing it locally, so I didn’t attempt it. I could see that the solution may require solving a Chinese Remainder problem, and I did solve it the next day. It’s nice that the site enters practice mode when the competition is over, so we can test out ideas later.

The good news is that I get to compete in Round 1B!