2017 Highlights

Just to avoid the embarrassment of going a calendar year without a blog post, here is a summary of some of my 2017 highlights. It’s not that I’ve been silent, but I’ve mostly switched to microblogging on Twitter (@xangregg) for sharing updates.

Tie-dye marathon

Some summers I keep track of a tie-dye “marathon” where I see how many days in a row I can go wearing a different tie-dye each day. This year I made it 32 days. Most of the images are on Twitter. Here are days 13 – 24.

Google Code Jam 2017

I tried both the code jam and the distributed code jam this year. In the regular code jam I made it to the round of 3000 before bombing out with a tied-for-last score. I keep forgetting the later rounds often require ready-made advanced optimization algorithms. I did better in the distributed code jam, advancing to the round of 500, which was enough to win my third code jam T-shirt. The distributed code jam is tricky since you submit code that is run on a distributed system of 100 CPUs. This year it helped that I built a test harness that used 100 threads for a better simulation of the actually process communication issues.

Packed Bars

Trying to find a decent way to show data with many categories in a skewed distribution, I created a new chart type called packed bars. Here’s an example showing costs of billion-dollar disasters before this year.

I presented packed bars in a poster at the IEEE VIS conference in Phoenix, and there are now implementations in JMP, R, Excel and D3.js. Ironically, the JMP script is the weakest implementation pending the arrival of JMP 14 in March 2018. It’s only great such data and has some learning curve, but skewed data is pretty common, especially in quality control (defect counts) and finance. I hope packed bars can become useful to others.

Low Water Immersion Tie-Dyes

As a follow up to last week’s test dye run, I tried a few shirts with my newly certified dyes. I also wanted to try out a possible button-down shirt supply. It’s hard to find dress shirts that are both all cotton and cheap enough to experiment with. However, I found a 97% cotton dress shirt for $23 on Amazon and decided to give it a try.

I’m using the low water immersion technique from Paula Burch’s site. Basically, you cram the shirt into a jar, pour dye(s) on it, wait an hour, add concentrated fixer, wait another hour or so, and rinse. Very simple if you’re happy with a random pattern (which has a greater risk of flopping).

Here’s the dress shirt.

navyshirt - 1

I tried mixing two dye colors, Navy Blue and Camel (light brown), but I don’t see any trace of the Camel. Nonetheless, the shirt turned out pretty well. The 3% spandex doesn’t seem to have causes any dying issues. The seams are apparently nylon and didn’t take up any dye.

Each jar had about four cups of water and about eight teaspoons of dye, which seems to have been too much. The short sleeve was all Rust, and doesn’t have as much variation as I was expecting from the test sample.

rustshirt - 1

This mix of Jade Green and Deep Yellow is my least favorite. Looks more like a laundry accident.

greenshirt - 1

Oh well, at least the dress shirt looks promising. Will order more of those.

Tie-dye Test Shirt

I haven’t been tie-dying much lately, and last time I tried my expected green shirt came out brownish yellow. I was afraid all my dyes had done bad, which apparently happens over time as they are exposed to moisture in the air. This week, I set out to get a sample of every dye I have. Not wanting to actually mix up 50 bottles of dye, I used an old shirt and put small smudges of dry dye on it in a grid.

dyesamples - 1 copy

The grid was set to match a wireframe shelf which I set down over the shirt to help keep each dye within the lines. So far so good. Then I poured soda ash water over the shirt, and things got screwed up a bit. I should have been gentler with the water application or placed another layer of fabric on top. Instead I got some splatter, but most of the color swatches can still be made out.

dyesamples - 2 copy

So it looks like most of the dyes are good. Below are labeled excerpts of each dye. It’s about the same layout as the shirt but rotated 90° clockwise.



A few colors are duplicated, reflecting how the collection has grown by “inheriting” dyes from retired tie-dyers.

I guess I was unlucky last year in using Moss Green, which is about the only bad color in the bunch. Chartreuse also looks a bit brownish.

How old is that globe?

Bonnie gave me this old globe for my 50th birthday last year, hoping it was also about 50 years old. There’s no date on it, but the seller thought it was from the 1960s.

globe - 4a

It wasn’t until I saw this recent XKCD flow chart for dating maps that I sought to get a better estimate of the globe’s age.


Simple as it is, I made a couple errors trying to traverse the diagram, beginning with the very first choice.

diagram closeup 1

My globe has Istanbul, but I went off on the left path, not noticing that the order of the labels was reversed from the order in the question. It didn’t take long too realize the error, but I almost didn’t catch my second error.

map closeup2

I was trying to follow the “no” branch from the Bangladesh node, and I mistakenly ended up at the United Arab Republic node. Apparently following a line is harder than I thought.

Once I had my errors fixed, the diagram dated my globe to the range 1961 to 1964. In the course of doing a little more exploring, I discovered the manufacturer of the globe, Replogle Globes, has a How Old is Your Globe? page. It’s a list of 100+ countries that have changed names since 1939.Screen Shot 2016-06-05 at 2.18.33 PMIt’s not as fun as the XKCD chart, but it’s more complete, especially in the period of my globe, though I still had to supplement it with Wikipedia to get the exact dates of the name changes.

Interestingly, the results are both precise and slightly conflicting. My globe has Zambia, which was Northern Rhodesia until October 24, 1964, and it has the United Republic of Tanganyika and Zanzibar which only existed for six months ending on October 29, 1964. That gives an amazingly precise origin window of only five days. However, the globe also denotes Malta as being a British territory even though Malta gained independence on September 21, 1964. I’m guessing the globe is from late 1964 or early 1965 with the Malta indication (and possibly Tanzania) being out of date. Regardless, 50 years is about right for the age.

globeclose - 1 (2)

The globe is in pretty good condition and has a two axes of rotation, making it easy to get a good look at Antartica, for instance.

globe antarctica

Google Code Jam 2016

Another year, another Google Code Jam. Similar result this year: made it through qualification and Round 1, but got stymied in Round 2. The competition seemed tougher this year both in quantity and quality. Though the bar for qualification was raised this year, 22,000 coders made it to Round 1, up from 12,000 last year. There are three time slots for Round 1 with 1000 advancing from each slot. In the first slot, over 1000 coders had a perfect score (not me), and as a result 40+ perfect scores didn’t advance from that slot (time was the tie breaker).

Fortunately, the second slot was a little more forgiving and I was able to advance with a score of 56/100, good for position 933. The problem I missed, Technobabble needed a Maximum Flow algorithm for an efficient solution, and I’ve since added one to my toolbox.

In Round 2, I got stuck on the first problem which involved a giant rock-paper-scissors tournament. I did the small version with brute force, and tried multiple algorithms to increase efficiency for the large problem, mostly relying on memoization and symmetry. However, I needed to step back and rethink the problem instead of blasting ahead with code. I eventually did figure out the efficient solution just after time expired. The time pressure of the competition has an effect on thinking that needs to be overcome.

I’ve been using a C++ parallel processing harness for these problems, and that may have made me less concerned about finding the very best algorithm. When you have the right solution, it usually takes less than a second to execute even though you’re given eight minutes to submit results. If a subpar algorithm takes, say, 30 minutes, for all the test cases, there is a chance it could still finish in time with parallel solving of the test cases. But this time, my subpar algorithm would take many hours to execute, so the parallelization couldn’t save.

With that ringing endorsement, I’ve put my parallel solver harness on Github as parallelcodejam. I’ve used it a few years, and I think it was only beneficial once.

The parallel code does have one nice idea, I think, which is to use an atomic integer as an index into a work queue. That way, the queue itself doesn’t need to be thread-safe, and the synchronization overhead is minimal.

Google Code Jam 2015 Round 1

I’ve made it through the Qualifying Round and Round 1 of this years Google Code Jam. You get three chances at Round 1, and this is the first year for me to advance after only one try. This was also the first time I used C++ instead of Java for the contest.

The most interesting problem to me in the qualifying round was Infinite House of Pancakes, where you essentially have to reduce a set of integers to 0 over a sequence of turns with two options at each turn: reduce every number by 1 or split any single number into two numbers (with the same sum). I correctly figured out you could do all the splitting up front, and coded an algorithm that would split the biggest number in half at each turn until the gain wasn’t worth it.

Seemed reasonable, but it failed to pass the small test set for the single number {9}. My algorithm would split it into {5, 4} and then reduce them down in 5 more steps, but it would be better to split it into {6, 3} followed by {3, 3, 3} and then reduce down in 3 more steps. So my halving algorithm had to be rethought. Fortunately there’s no real time pressure in the qualifying round, and I was able to work out a better algorithm that passed the small and the large test cases.

For round 1, the first problem, Mushroom Madness, was straightforward, and I had it completely done in under 15 minutes. I also figured out how to do the third problem, Logging, pretty quickly, but coding it was tricky and took most of my time. The problem involved convex hulls over point sets and I didn’t have any geometry routines handy to speed up the process. I had to get lucky that my code worked without too much debugging.

The second problem, Haircut, was the most interesting for me even though I couldn’t figure it out completely during the contest time limit (150 minutes). Given B barbers operating at different constant speeds and a queue of N customers, you had to figure out which barber the Nth customer would get. Seemed like a straightforward use of a priority queue to simulate all N haircuts. However, just as I was starting to code I noticed the problem said that N could be up to 1 billion, even for the small problem. A priority queue simulation would be about O(N * log(B)) — great for lots of barbers but not for lots of customers.

I coded the simulation anyway hoping it would either trigger a better idea or would be fast enough for the 4 minute time limit (after all, my Mac Pro can do billions of operations every second). It wasn’t fast enough, but I thought of looking for a cycle and noticed the limits for the small problem were 5 barbers taking at most 25 minutes each. That meant I could afford to simulate the least common multiple (LCM) of all the barber times after which the pattern would repeat, and that worked for the small test case.

The large test case allowed 10,000 barbers, and I knew the LCM would be way to big to simulate and didn’t even attempt it. Not having any better ideas, I moved on the the last problem which was fortunate because I needed almost all of the remaining time for the Logging code.

After sleeping on it (did I mention the contest started at 9pm local time on a Friday night?), I worked out a solution to the large Haircut problem the next day. Though I couldn’t compute b(n) in time, I could compute the inverse, n(b), fairly quickly. Since n(b) is monotonic, I could use a binary search to find where n(b) became the n I was looking for.

Implementing the binary search was a different story because part of my reason for using C++ was to force myself to learn more of the STL and idiomatic C++11. STL has a binary search function called lower_bound which expects to operate on iterators over a container. I couldn’t find a good way for it to operate on an indirect evaluation of a function. I tried defining a custom iterator, but it felt wrong, partly because each iterator needs its own reference to the function. I’ve since discovered Boost ranges which might do the job.

Instead I wrote my own indirect version of lower_bound that took an integer range and a function object to call.

// returns lowest number, n, for which f(n) >= value
// f must be non-decreasing
template<typename I, class T, class F>
I lower_bound_eval(I first, I last, const T& value, F f) {
   I count = last - first;
   while (count > 0) {
      I i = first;
      I step = count / 2;
      i += step;
      if (f(i) < value) {
         first = ++i;
         count -= step + 1;
         count = step;
   return first;

That worked, though it took me many hours more than the problem allowed. Without the 22 points for that problem, my final score was 78/100 which put me at #337 out of about 6000 in that round with the top 1000 advancing.

Data Science Specialization at Coursera

Last year, I took the nine online courses in the Data Science Specialization offered by Johns Hopkins University via Coursera. It could have been called “Data Science with R” since one whole course and a good part of the other courses were more about R programming than data analysis. I certainly learned more about R and the R ecosystem than about statistics and data science, but I already had some knowledge of the latter. The nine courses are:

  1. The Data Scientists Toolbox
  2. R Programming
  3. Getting and Cleaning Data
  4. Exploratory Data Analysis
  5. Reproducible Research
  6. Statistical Inference
  7. Regression Models
  8. Practical Machine Learning
  9. Developing Data Products

There’s also a “capstone” course/project, but it’s not offered as often, and the timing didn’t work for me.

Course Structure

Each course was four weeks, and the pattern for most weeks was about 30 minutes of video lectures and a multiple choice quiz. Some courses also had a project involving some data analysis, and those were the most educational parts of the series. There were also practice assignments and online discussion forums, but I found the Coursera site too sluggish to really use those casually.

As would be expected from courses like this, what you get out depends on the extra effort you put in. Someone completely new to the material would have to spend a lot of self-learning time because they’re not going to learn technical material like programming and statistics from a few video lectures. Unfortunately for me, the multiple choice quizzes made it a little too easy to get by with little effort. That’s partly a necessity of an online course, but being able to take each quiz three times seems to undermine any rigor. If you can eliminate just one of the four choices, you’re guaranteed to get each question correct by the third try.


Instructors. One thing that kept me in the series (besides my stubbornness to finish what I started) was the quality of the instructors, Brian Caffo, Jeff Leek and Roger D. Peng. They obviously have an enthusiasm for the subject and did a great job organizing the material. I especially like that they would often do live R coding during the lectures (I’m sure there were some edits, but still…). Sometimes that’s where the best R tips were learned.

Material. I liked that they spent a good amount of time just on programming and just on reproducible results, two topics that could be ignored or brushed over and still pass for data science. For the modeling, they made of point of avoiding linear algebra, which was a nice change from the standard approach.

Projects. While still constrained by the time available, these longer assignments forced you to apply the material to perform some basic analysis and publish the result. The projects used real data sets which often required some clean-up/preprocessing, which was a good lesson in itself.


Peer grading. The longer assignments were better measures, but due to the compressed timing, they had to be kept pretty simple, and due to the large number of students, they had to be peer graded. For peer grading, each student had to grade four others. I can only hope they required some sort of agreement among duplicate graders because the grading guidance was minimal and coarse. (Is there a model provided? Yes/No. Is the model correct? Yes/No…)

The one course I didn’t get high marks in was the course I expected to be my best: Exploratory Data Analysis. I didn’t get good scores on the the long assignment from the peer grading. I knew I wasn’t strictly following the assignment guidelines, but I hoped the graders would be more flexible. The assignment called for a report with something like three pages of text and an appendix with two pages of graphs. I prefer to put the graphs inline with the text, but some graders stuck with the literal guidelines and gave my zeros for that part of the grading (there was often no way of giving partial credit).

Coursera. Besides the sluggish forums and grading limitations already mentioned, I would like to go back to review parts of the material (partly so I could be more specific in this review), but the archives are no longer available. Last month at least some of the course archives were available, but now none are. Given the scope of the material, providing later access seems essential.

Bottom Line

For someone new to the field, I think these courses will be too brief to learn the material well, but they provide a great tour from some great tour guides and will help frame further study.