Exploring Insurance Pricing Data

One of first articles on the new investigative journalism site, The Markup, is about Allstate’s proposed changes to Maryland’s insurance premiums, supposed driven by an “advanced algorithm.” The Markup got the data and found a simple and embarrassing model behind the new proposals. However, the most interesting parts for me were in the companion “show you work” article. Not only are there deeper details about the analysis (with graphs!) but also a link to all the raw data and analysis files.

The main data file has information on premiums for 93,000 Allstate customers, and I’m going to focus on three variables, using names from the data file:

  • Current premium: what the customer was originally paying
  • Indicated premium: Allstate’s calculation of what the premium should be, based on a risk assessment. Called “ideal price” in the article.
  • Selected premium: what Allstate was asking for approvals as the new premium, based on the secret algorithm that would take many variables into account and move the new premium in the direction of the indicated premium. Called “transition price” in the article.


At first glance all the premium variables look pretty innocent. Here are the distributions.

I’ve truncated the axis so that a few outliers don’t distract from the core group. An important take-away to keep in mind is that most of the premiums (62%) are under $1000.

Looking at the difference between the indicated and current premiums also looks normal – about half the time the indicated values are less than the current values.

It’s when you compare the selected premium against the current premium that things start to look a little suspicious.

A couple things stand out: the differences are much smaller overall and they are skewed with the negative differences being even smaller. Many of the negative differences (about 10,000) are less than one dollar. Apparently (read the article) the small increases are for the sake of customer retention and the nominal decreases are to meet the promise of moving in the “general direction of the new risk model” aka the indicated premium.

Looking at the selected change as a percentage of the current premium is where things start to get really fishy.

Almost all of the changes are around 0%, 5% or 20%. I think the 20% group is what the article refers to as the “suckers list,” but, as we’ll see next, those customers are actually paying less than their indicated rate suggests and the customers in the 0% group are actually the ones losing out.

Scatterplot make-over

We’ve already gained a quite a bit of insight just from looking at distributions of the three variables and their differences, but what I was really after when I started was to find a better multi-dimensional view of the data. Even though there are only three variables, it’s not easy. I think the terminology is part of it, which may be why The Markup invented their own terms. In their “show your work” article, they modeled selected premium as a function the other two variables. Their model looked really good from a p-value perspective, but that only means it was definitely better than nothing. They shared this graph using the residuals from their model.

I can tell it’s showing something important, but I’m not sure what. (Interestingly, it almost looks like a smoking gun.) I know the residuals shouldn’t have any pattern, but I find it hard to translate the pattern I’m seeing back to the original data in my mind.

I started looking for more direct representations (without involving the model) so get the equivalent insight in plainer terms. I ended up with this graph, which still needs some explaining.

Compared to the original, I’ve replaced residuals on the x axis with the indicated change, as a percentage of the current price. In theory, the selected change would be some function of the indicated change. I’ve added a gray diagonal line to show where the selected change equals the indicated change. So we see the clusters at 0%, 5% and 20% as we did in original scatterplot and in the earlier histogram, but now we can see how they relate to the indicated change, including for the values not in those clusters.

I’ve also colored the points by the current premium, partly to link the relative values to absolute dollar amounts and partly to confirm The Markup’s finding that mostly high-dollar customers were put into the 20% group.

The pattern is so stark, you may not believe it’s a scatterplot, so I’ll zoom in.

And again, showing the 0% and 5% connection.

And once more, right around the 0% bend. It’s interesting how the points look randomly jittered within 0.5% of the target, but these are the actual values. Also, I have no explanation for the little jig around 0%.

What does this scatterplot show us? Here’s an annotated version, zoomed out.

Dots above the diagonal line have a selected premium more expensive than the indicated premium. That is, they would be paying more than they merit (according to the indicated risk model). The annotations call-out four groups:

  1. The left-side group at 0% (“decreases”) are missing out on a larger discount. This is the largest group.
  2. The right-side group at 5% (“small increases”) have a selected premium with smaller increase than is merited, for the sake of retention.
  3. The right-side group at 20% (“large increases”) have a selected premium with smaller increase than is merited, for the sake of retention. But because of their higher current premium than group 2, they apparently have a higher retention tolerance.
  4. The group along the diagonal (not categorized in original article) have an indicated increase less than their retention threshold and have a matching selected premium.

You might look are this plot and think Allstate is being generous in charging less to all the customers in those long right-side groups. However, the number of dots doesn’t convey well when you’ve got 93,000 dots in a small space, even with some transparency enabled as I have done. The annotations include the counts to help a little. Here is a heatmap version, colored by the number of customers in each cell.

You can barely see some of the faint green cells that have under 1000 customers in them.

How to combine the count info and the current premium for each position? We could bin the x and y coordinates to get an aggregated group for each combination or rate changes and draw them as bubbles colored by the mean current premium and sized by the count.

Not bad.

Wrap up

There are plenty more angles covered in the article, including breakdowns by age, gender and location. I haven’t even gotten into sums. Interestingly, the sums of the increases and decreases almost exactly balance out. The more I think about it, it seems the articles got it wrong by saying the big spenders were on a “suckers list” because they’re getting a 20% increase instead of a 5% increase. They’re still paying less than they should and the savings are being offset by the real suckers, those who deserve a reduction they aren’t getting.

I hope others have taken or will take the opportunity to explore this data.


I have Allstate car insurance. I am not a statistician or a journalist or an actuary. All my graphs were made in JMP, which I help develop.