Relationship chord diagram study

I saw this “relationship chart” in a 10 Visualizations Every Data Scientist Should Know by Jorge Castañón and was intrigued. I’ve never found these diagrams very valuable, but I was eager to learn where they are useful. Maybe I just needed the right data. In this case, the data consists of patient attributes in a drug study.

Each node is an attribute value and each curved line between two nodes represents patients having both attributes, with line thickness corresponding to the number of such patients. The article listed three insights:

  1. All patients with high blood pressure were prescribed Drug A.
  2. All patients with low blood pressure and high cholesterol level were prescribed Drug C.
  3. None of the patients prescribed Drug X showed high blood pressure.

How does the diagram support these statements? Not very well. It turns out some of these “insights” are not even true, let alone easy to discern. Claim #1 is false because there is a line from high blood pressure to Drug Y. Claim #2 describes a three-way relationship, which is not generally represented in the chart. I downloaded the nicely-provided raw data, to find the claim was actually false. Claim #3 is true because there is no line connecting Drug X and high blood pressure.

Likely the errors were editing mistakes or draft mix-ups, but the fact that an advocate for the usefulness of the chart didn’t notice the errors suggests the charts aren’t that insightful after all.

When I pointed out the errors on Twitter, the author immediately correctly them in the article, which was great to see. Now the claims read:

  1. All patients with high blood pressure were about equally prescribed Drug A and Y.
  2. Drug C is only prescribed to low blood pressure patients.
  3. None of the patients prescribed Drug X showed high blood pressure.

(I just realized that insight #3 is redundant given insight #1.)

Even is the chart type is not that effective, it could be that all other options are worse. That is, maybe seeing relationships between eleven attributes is too much for quick graphical understanding. So let’s try some alternatives.

The most obvious alternative is a graphical adjacency matrix since the original is a node-link graph and any node-link graph can be represented as an adjacency matrix. Here each square represents the number of patients with the X and Y axis values in common.

The missing squares certainly pop out better than the missing lines of the original for claim #3. To test claim #1, find BP/HIGH on the Y axis and scan across for the drug values. Drug A and Drug Y have about equal sized rectangles.

Since the data size is relatively small, we can replace the rectangles with grids containing one dot per patient.

The take-aways are generally the same but with a little extra precision since you can count dots if you like.

These eleven attributes are not all independent — they represent four variables with two to four values each. I’ve taken advantage of that in the axis layouts above, and can go further by using a different chart type, parallel sets, a generalization of parallel coordinates for categorical data.

Is that useful? Less so, I think. The lines do help support the connection concept, but the usefulness depends on the arrangement. Relationships between adjacent axes can be discerned but others can’t. However, it can be useful if you interact with it.

To test claim #1 we can click on the BP/HIGH value and see that those patients got both Drug A and Drug Y.

To test claim #2, select the combination of BP/LOW and Cholesterol/HIGH to see that both Drug C and Drug Y were included.

To test claim #3, select Drug X and see that none of the BP/HIGH group is highlighted.

I’m still not sold on radial relationship charts and prefer the matrix as a static view, perhaps adding marginal indications of the size of each group which would correspond to the circle sizes in the original. But the radial charts are so popular I feel like I must be missing something and will keep studying.

Crossword solving times

I’ve been doing crossword puzzles on my iPad over breakfast (usually oatmeal) for the past few years. I got a subscription to the weekly American Values Club crosswords and would also do some free puzzles. Last year, I went in for a subscription to the New York Times crossword puzzles when they were having a discount, and I’ve been doing those almost daily.

The NYT crossword is bit stricter than my general iPad Crosswords app. For instance, the NYT app doesn’t make it easy to look up clues via Google, which is odd since their solving guide touts “It’s not cheating, it’s learning” and quotes a former editor saying, “It’s your puzzle. Solve it any way you like.” And you can’t check your answers until the next day.

After each puzzle solved, the NYT app will tell you your time and solving streak (only counting same-day solves), which does provide a little extra motivation. I’ve often wanted a download button to get access to my past data to measure the day-of-week difficulty increase and to track progress over time. Only recently I realized that each day’s puzzle has a unique URL and I could visit past puzzles, solved or still in progress. And that made me look for a way to automate those visits to collect my timing data and solved status.

My usual technique of running a JMP script to get the raw HTML for the page didn’t work here, because it didn’t have my authentication for the NYT site to identify me. Perhaps there is a way to pass along that info in the URL or the payload, but I wasn’t hopeful . Instead I looked at automating Chrome to load each page for me. It turns out AppleScript is still around for automating MacOS apps, and it even supports a JavaScript syntax now. Not that I know JavaScript that well, but there’s plenty of help for it online.

I uploaded my resulting AppleScript/JavaScript as crossword_times.js on GitHub in case anyone is curious. Getting Chrome to load a URL was fairly straightforward, but getting the data was a bit trickier. I had to study the HTML page to find the timing text and done status, and then get Chrome to execute an XPath query to retrieve the info I wanted. That meant having my script send another script to Chrome for execution, which required turning on an option to allow that in Chrome settings. I put all that in a loop and dumped the data as JSON for import into JMP.

Looking at the entire 16 months of data doesn’t show much, except that there were more at the beginning that I didn’t even start (those red x marks along the bottom). Zooming in on recent times, you can get a sense of the daily difficulty pattern.

Breaking out the weekdays, you can see both how the difficulty increased throughout the week and maybe that I’ve been improving over time.

The trend lines are only based on the solved puzzles. There exist statistical techniques to incorporate the unsolved times into the trend by treating them as censored values, but I didn’t have much luck with it, apparently because I often gave up early and with irregular amounts of effort in general.

The week-end puzzles are harder and/or bigger and I put them on a separate scale. Looks like I’m not improving too much on these. ?

I shared these images and more on Twitter last month, and my data is available as crossword_times.csv on GitHub and in interactive form on JMP Public, where you can filter by day of the week, for instance.