Remaking ProPublica’s Blocked News Graph

Earlier this month, investigative newsroom ProPublica published an interactive graph showing the availability of various news web sites within China.

original

My initial reaction was that the missing (gray) and inconclusive (yellow) data are dominating the view without providing much information. Of course, the graph is precisely reflecting the data available, but I’d prefer a view that helps me see larger patterns. Which news sites were likely blocked all year? The Huffington Post stands out with a lot of red, but it’s not so obvious what other news sites were also never measured as available throughout the year.

Here’s a close-up of three sites with lots of inconclusive measurements that dominate the coloring. They have three different mixes of conclusive measurements, but that difference doesn’t stand out until you focus on it.

originalclose

So my main “improvement” is to interpolate the absent/inconclusive data. For instance, if a site is measured as blocked, then unmeasured, then measured as blocked again, relabel the middle no data region as presumed blocked. Same for open, and for mixed endpoints relabel no data as in transition. Inconclusive is a really tricky category; I leave it as different from no data but similarly tint inconclusive measurements according to the surrounding measurements.

My second improvement is to avoid the red/green colorblindness issue. I switched the open color from green to blue. The stoplight colors have strong connotations which help with the interpretation of the original graph, and so it’s a trade-off to balance a small benefit to the 97% with normal vision versus a large benefit to the few. Still, here’s what the close-up above looks like with the most common red/green colorblindness (using Color Oracle):

originalblind

My third improvement is to reduce the white space between rows. I didn’t realize it until I tried it, but I think the strong white banding is distracting.

Finally, here’s my view (data through December 19):

remakeshort

Is my improvement better? In its current form it looks a bit busy, but I think if a designer picked better colors, it would be quite functional, assuming you buy in to the interpolation idea.

With my remake, it’s easier to see the three sites that were never seen as blocked (CNN, ProPublica, and Washington Post), the six news sites that were never seen as open and especially the transitions, which might be the most interesting parts. Presumably, the late September transitions are related to the Hong Kong protests.

I haven’t tried to reproduce the nice axis labeling, reference lines or the interactivity of the ProPublica version, all of which I like.

Toronto Discs

Bonnie and I spent last week in Toronto, and the week ended with a Frisbee theme. Though most of the trip was unplanned tourism, I did preview the area Ultimate offerings before leaving and found a Friday evening pick-up game that was only a 30 minute walk from the hotel.

torontoultimate1

Here’s a pic of the game at Sir Winston Churchill Reservoir with the ever-present CN tower in the background. The game was recreational, which is good because my legs were a little out of sync from walking around the city all week.

By complete coincidence, our hotel was across from the home field of the Toronto Rush semi-pro Ultimate team and they had a play-off game that Saturday night. Here’s a pic from the hotel room of the team warming up for the game.

torontoultimaterush1I attended most of the game despite the steady light rain. The rules were a little different with active referees assessing yardage penalties for violations. Unfortunately the game was a little sloppy due to the rain, but there was still some nice play amplified by three on-field video cameras and a giant replay screen on the scoreboard.

The next day, we ran into a third disc connection: a full disc golf course on the tiny islands in Lake Ontario. I don’t think I’ve ever seen a disc golf warning sign.

torontodiscgolf1

Not sure how active the community is. I saw this flyer for a tournament on the bulletin board and thought I might enter until I noticed the date.

torontodiscgolf2

Benchmarking resizable arrays versus linked lists in C++

The second half of Herb Sutter’s recent talk, “Modern C++: What You Need to Know,” explores the question of how many items do you need before it’s better to use a std::list instead of a std::vector for an insertion sort. The conventional wisdom is that eventually the cost of moving half or so of the vector to make room for a new element will overtake the overhead of the linked list traversal. However, testing showed the answer is never and underscored the importance of memory locality in modern computers (with lots of prefetched cache data).

I decided to try it out on my own machine, but before I got too far with the code I looked around to see if someone already had code I could use. Soon enough, I found C++ benchmark – std::vector VS std::list VS std::deque on Baptiste Wicht’s blog. He ran lots of tests (not just insertion and removal)  on lots of data types (not just integers). Like Herb Sutter, Baptiste also found arrays to be faster than linked lists for small objects, but lists were better for large objects. The meaning of large varied depending on the algorithm. For sorting, large was 16 bytes, for insertion, large was more like 64 bytes. He didn’t explore arrays of pointers to large objects.

In trying out his year-old code on my Mac, I encountered a couple compile errors and one logic error. Since the code was on GitHub, I thought it would be a good chance to try it out. I set up a GitHub account, forked his code into my account, fixed the issues, and made a pull request for my changes to be merged back into Baptiste’s original code. A couple days later, he accepted the changes — the internet works!

Here are some results from my runs. I went up to n = 1,000,000 for two operations and two data sizes.

vectlist

 

Looks like for my PowerMac, large is somewhere between 8 and 32. The real hero is deque, which is segmented vector that can easily grow at both ends.

 

Hacker Cup 2014 Round 1

I was both disappointed and happy with my Round 1 results in the Facebook Hacker Cup (scoreboard and problems). I was disappointed that I got two of the four problems wrong, but happy I did well enough to advance to round 2 (and came close on the two I missed). The round lasted 24 hours, so I should have taken the time to study and test the problems more that I did.

Problem 1 was essentially about representing a number in a given base with a given digit alphabet. The only trick not covered in the test set was the case where the “base” was 1. Fortunately, I caught that before submission.

Problem 2 involved finding the optimal strategy for a coin game. Not much programming involved, but I was off in my analysis of what the optimal strategy was and missed this one.

Problem 3 required a search for the longest path through a grid with a couple constraints, including being limited to one “backward” string of moves. I used a dynamic programming solution with a 4D array to represent the longest path from a given position (2D), backward-usage (1D) and incoming direction (1D).

Problem 4 required the most analysis, which I mostly got right, but overlooked one detail. The problem boiled down to adjusting a set of up to 20 numbers so that the sum is minimal and every pair has the same greatest common divisor (GCD). My optimized search of the problem space seemed to be good enough, but the detail I missed was in handling of GCD when one of the numbers was 0.

The moral (at least for such lengthly rounds) is to augment the test cases before declaring victory and submitting. Doing so saved me for problem 1 and could have saved me for problem 2. However, the next round lasts only 3 hours, so there is more cost to doing more testing.

Round 2 Update: I completed one problem and thought I completed a second, but had a problem submitting the solution. I accidentally ran my code twice and then got an “invalid character” error after trying to clean up the outputs. However, I had no time for the others and wouldn’t have advanced anyway.

Hacker Cup 2014 Qualification Round

I passed the Facebook Hacker Cup qualification round last week-end (problems). Unlike other rounds, there’s no real time pressure in the qualification round. I normally use Java for these competitions since the most productive IDE I have is JetBrains’ IntelliJ IDEA. However, since I’m trying to become more adept at the new idioms of C++11, I also tried that (using Xcode) for two of the problems.

The first problem involved identifying a square of ‘#’ characters in a field of ‘.’ characters. After some messy Java code that analyzed the grid as it was read in, my C++ effort read in the whole grid into memory first. The following C++ felt decent for finding the first and last rows not consisting of all ‘.’ characters:

string blank(n, '.');
auto isntBlank = [&blank](const string & row) { return row != blank; };
auto firstRow = find_if(grid.begin(), grid.end(), isntBlank);
auto lastRow = find_if(grid.rbegin(), grid.rend(), isntBlank);

However, I couldn’t subtract those variables to find the difference because the types are different (one is a forward iterator and the other is a reverse iterator). The base() method converts a reverse iterator to the underlying forward iterator, so the count is determined by

  lastRow.base() - firstRow

A few applications of similar find logic provided a solution that contains no explicit loops, which is a nice result.

I implemented the second problem dealing with basketball substitution patterns using a pair of priority queues for each team: one queue for the players on the court and another for the players on the bench. I don’t think there’s a clean way to allocate an array of generic collections in Java, and I ended up with the following for the bench of the two teams:

@SuppressWarnings("unchecked")
PriorityQueue[] bench = new PriorityQueue[2];
...
bench[t] = new PriorityQueue<>(n, new Comparator() {
   @Override
   public int compare(Student a, Student b) {
      if (a.minutes < b.minutes)
         return -1;
      if (a.minutes > b.minutes)
         return 1;
      if (a.draft < b.draft)
         return -1;
      if (a.draft > b.draft)
         return 1;
      return 0;
   }
});

C++ has a built-in pair<> template class that makes the comparison part more compact, but I never found a clean way to declare the array. For each type of queue, the players and the bench, requires a different comparison function, so it can’t be built-in to the object class itself. There’s a constructor to supply a separate comparison function for that case, but I couldn’t get the constructor argument to match the generic comparison type so I ended up specifying the type in the template argument list. It seems to violate the C++ DRY goal, but it’s not too bad using a typedef.

auto benchLess = [](Student const & a, Student const & b)
   { return make_pair(a.minutes, a.draft) > make_pair(b.minutes, b.draft);};
typedef priority_queue<Student, vector, decltype(benchLess)> BenchType;
BenchType bench[]{ BenchType(benchLess), BenchType(benchLess)};

The third problem involved a dynamic programming style probability calculation and contained the statement, “It is guaranteed that the output is unaffected by deviations as large as 10-8. ” I’m now not sure what that means. I took it to mean that conditional probabilities could be rounded to 10-8 and values less than 10-8 could be ignored in order to get the required 6 digits on accuracy. As an optimization I ignored values less than 10-10. However, that led to one of my answers having just less than the required accuracy, so there must be some other meaning to the statement.

Hoping to try round 1 next week.

 

Volunteering at the Ultimate Frisbee US Open

I spent the long Independence Day week-end volunteering at the USA Ultimate US Open, held in Raleigh this year. My main job was scorekeeper, though I also did some field set-up. The hardest part was just standing in the sun all day, but that was better than rainstorms like we had before the tournament began. The volunteers kept the score and some basic stats like which players scored or made turnovers, but some teams had their own stat-keepers using iPad apps that tracked every throw.

The US Open is an invitation-only tournament of eight top team in three divisions, Women’s, Mixed and Men’s, including some international teams. Team Columbia was fun to watch but hardest to keep stats for since their uniform numbers were small and at the bottom of their shirts which were sometimes tucked in. I worked games in all three divisions. In my small sample, it seemed that play quality improved as I went from Women’s to Mixed to Men’s. That guess is born out by a look at turn-over to goal ratio from the US Open basic stats page.

usopen1

However, there is significant variation within divisions. All the teams:

usopen2

These stats are through pool play (before the semifinals), and interestingly none of the eventual winners (Revolver, Odyssée, and Fury) had the best TO/Goal ratios for their respective divisions.

Youtube has some highlights, including the winning catch of the Men’s finals.

 

Tie-dye Sportcoat

I’ve wanted to try tie-dying a sportcoat for a while, occasionally looking for white cotton jackets in thrift shops. Finally I broke down and bought the cheapest new white cotton jacket I could find, a $60 G by GUESS blazer. The low-water immersion technique seemed best since it might be difficult to fold and tie something like a jacket. I thought I would keep it a light colored jacket by using diluted periwinkle for the color, but it still came out with strong color (though it’s not as blue as it looks in the photo).

IMG_0363

Apparently the styling is meant for teenagers with long skinny arms. I haven’t figured out any occasion where it might be appropriate to wear it, but I did wear it to work one day with mixed reactions. I’m told if I wear it unbuttoned, it looks less like a woman’s jacket. Maybe I’ll try again with a duller color like brown, gray or dark blue.

I also did a shirt using a variation of the low-water immersion where I packed the shirt into a tight container and sprinkled dye powder over it and then added water. I put the purple dye around the edge and the yellow dye in the middle. I think it came out very good.

IMG_0357