Calibrating Lax-ELO
The off-season is a great time to take stock of the year and think about what could have done better. Unless you are a coach; in which case, you have to go find the next generation of players to stock your roster.
But since I am not a coach, reflection is the order of the day. And one area where this can be very useful is in calibrating some of the models we use during the course of the season. Want to make sure everything is running in tip-top shape heading into 2020.
Nate Silver would be proud
Calibration is the idea of going back to look at your “predictions” and understand how the underlying models did in predicting outcomes. If there are obvious areas where the model was under- or over-confident, calibration highlights them.
To calibrate, you need a set of definitive outcomes to compare your predictions against. For me, the only modeling I do that fits that bill is the game-by-game Lax-ELO predictions and the Bracketology predictions.
Now that said, even if I did want to calibrate the Bracketology model, it would be of limited use. For starters, I made a relevant prediction on only around 4 to 5 teams. (I’m not going to claim credit for predicting that Duke would get in for example or that Bellarmine would be left out). That is too small of a sample size to be meaningful.
On the other hand, Bracketology makes some assumptions about how the committee will compare teams. For the most part, we assume they will look at RPI very closely (and I’ll continue to do so until they show something different). So if the model was inaccurate, it might be time to scrap the RPI-based model, but it wouldn’t help me much in terms of calibrating it.
So that leaves the game-by-game predictions, which often come to you courtesy of our trusty robot LaxRefBot. Every game produces a prediction, based on the Lax-ELO rating of the two teams and any home-field advantage that is involved. It’s a super simple model; 2 ELO ratings can be converted into a win probability pretty easily.
Report Card time
And as a result, we end up with a fairly rich data set of 2019 predictions (541 in all). This is plenty to assess how well the ELO model did. So let’s.

Not too shabby. The model was a bit under-confident when it came to heavy favorites (i.e. a greater than 85% of winning). In other words, those teams ending up winning more often than the model predicted.
It certainly wasn’t perfect: in the 14 games where a team was given between a 20 and 25% chance to win, they actually game through with a victory in 6 of those contests (43%).
On the other hand, in the 41 games where a team was given between a 45 and 50% chance to win, they actually won 48% of the time.
Distilling accuracy down into one metric is a fairly complex calculation that I don’t want to do right now, so I’ll rely on the fit of the chart. With an correlation metric of .94, the ELO model was fairly accurate.
In praise of incrementalism
Compare that with 2018’s predictions, when the same chart looked like this:

Clearly, the 2018 model was a bit less consistent. The errors in each bucket tended to be a bit more pronounced. And if we go back to that correlation metric, 2018 scored just .84.
To tinker or not to tinker?
In truth, this comparison was all I was really concerned with. Could the 2019 predictions been more accurate if I changed the settings in how the ELO model calculates ratings for each team? Perhaps.
But it’s not worth the effort to try and improve it…yet. And there are 2 reasons for that.
First of all, we had 2 new teams in Division 1 last year and they both started the season with league-average Lax-ELO ratings of 1500. By the end of the season, Utah had a rating of 1373 (56th) and St. Bonaventure had sunk to 1219 (64th). It’s very likely that their early season ratings overstated their team strength, which could easily have resulted in some games being mis-called.
Could I have set their ratings lower to start? Perhaps, but what is the right level? Maybe if we had more data about new teams coming into Division 1 and their approximate average team strength, then we could come up with a better “new-team-rating.” But until that happens, I’d rather not add any extra bias to the system. Their ratings will normalize soon enough.
The second issue is similar to the first point, but on a league-wide scale. With only 6 years of good data, the Lax-ELO model is just starting to round into shape. We started 2013 with every team at the same starting point (1500). Over the years, the good teams have risen to the top and the bad teams have dropped to the bottom.
This natural calibration of the ELO system should naturally produce more accurate predictions year over year. The challenge is to understand how much of the improvement is based on that and how much is based on random chance.
So we’ll be checking back eagerly next year to see if the prediction accuracies have improved again. Eventually, it will be clear that the ratings have stabilized because we won’t see any more year over year improvement.
At that point, it will be time to start to consider enhancements to our predictions, which will likely still be based on ELO as a foundation.
The shot clock?
You’ll notice that I haven’t even mentioned the shot-clock rules. Should we adjust the predictions based on stylistic differences between teams?
Who knows?
Calibration exercises do not tell you why your predictive model is working or not. They just help to understand if there are glaring issues that need to be addressed.
The shot clock may have caused the pre-season ELO ratings to be less accurate than normal (i.e. a better prepared team may have been able to adapt to the new rules more quickly). But if you backed me into a corner, I’d say that the effect of another year of stabilization would dwarf any effect from the new rules.
Once we hit stabilization though, I do suspect that pacing will be one of the areas that we can look at to try and enhance the model.