Play-by-Play to EGA (a recipe)
But this way of thinking about stats leaves a lot of gaps in our understanding of what is really going on. Take points for example: a player can have more points for several reasons. Maybe they take all their team’s shots. Maybe they are just a really good shooter who takes full advantage of limited chances. Maybe their ability to force slides means that their teammates are always open for dunks on the doorstep.
And this is fine, except for the fact that points is the most common way that you’ll see players compared. It is not an apples-to-apples comparison even within position groups. Now try comparing the skills of a faceoff specialist vs an attacker.
If you want a way to compare players that accounts for the varying contexts, you need a better metric. For me, this was one of the earliest tasks in the LacrosseReference project, and it resulted in a metric that I call expected-goals-added (EGA). EGA takes note any time a player appears in the play-by-play and aggregates all those little contributions into a single score.
But before we can do this, we need to be able to understand how each contribution contributes (or takes away) value from a player’s team.
Fundamentally, if a certain type of play more often results in a player’s team scoring a goal than their opponent scoring a goal, we can say that it has a positive expected value.
In the simplest example, if you assume that Maryland commits a penalty, and Notre Dame has a 60 second man up opportunity, then within the next 1 minute, would you expect Maryland or Notre Dame to score more goals on average? Notre Dame obviously, so we can say that the penalty committed by Maryland has a negative expected value. I’ll go out on a limb and say that this is not a stretch for those familiar with lacrosse…or averages.
The Good (raw) Stuff
In the list below, assume that Maryland is the one initiating each play. The Diff_per column shows how many goals Notre Dame would be expected to score in the next 60 seconds. For example, a saved shot here means that Maryland shot on goal and Notre Dame saved it; we would expect Maryland to score .17 goals in the next 60 seconds to .12 for Notre Dame.
These are the values I use for D1 MLAX. See below for the D1 WLAX data.
Once these values have been calculated, then we go through the process of aggregating a player’s contributions. I have developed a process that converts a game’s play-by-play log into a discrete list of “events”. These events are tagged to each player based on the play-by-play entry. Once this data set exists, it’s a simple matter of adding up the “value” of every event tagged to certain player. The end result: their EGA.
I keep a running list of the current EGA leaders in the stats section of the site. I have dubbed it the Statistical Tewaaraton and you can use it to keep track of the leaders on both the men’s and women’s side.
Stats can be positive or negative. Credit should be too.
When I look at the raw data, a few things stand out, some expected (and helpful for confirming the validity of the model) and some not. First, if you commit a penalty, then you should expect to give up more goals that you score over the next 60 seconds. Makes sense. Same goes for turnovers and shot clock violations, if you give up possession, good luck scoring more than you give up.
On the flip side, the two goal entries are interesting. It’s minimal, but both assisted and unassisted goals have a positive expected value, meaning in addition to the goal scored, the model says you should expect .03 to .04 goals in the next 60 seconds. It’s tiny, but this speaks to the value of momentum? Or perhaps it is less causal; just an implication of better face off men leading to more goals as well as more runs. Most likely, it’s just the fact that a team scoring a given goal is minutely more likely to be the better team, and they should score more goals. (Note to self: worth another post.)
After thinking more about this one, it makes more sense, but I was surprised to see Missed Shot as the highest net goal getter in this list. (Mind you, our model differentiates between blocked shots, pipe shots, saved shots, and missed shots.) In retrospect, it probably makes sense; if you have a completely missed shot, it’s probably getting backed up, so you still have possession. Contrast this with a pipe shot which could go anywhere. Also, a missed shot also means you are in attacking position with, presumably, enough offensive action to warrant a shot. Contrast that with a ground ball, which means you have possession, but it may not be in the attacking end.
Ok, so what?
So what do we do with this information. Well for one, we don’t start whipping shots over the goal from midfield. This is a purely descriptive analysis; if someone in a wacko-universe adjusted strategy on account of this, it would be like the snake eating its own tail.
For starters, it would be very interesting to see whether there are certain teams for whom these values are starkly different. Are there certain styles for whom a missed shot is not in fact a good thing? Are certain teams better able to capitalize quickly on the change in momentum from a turnover? All interesting questions that speak to style I suspect.
In addition, we posted earlier this month on our win-expectancy model, and these play-specific values are integral to that model. Time, score, and possession create a very effective framework for determining win expectancy. By looking at historical games, you can create a pretty accurate model based on the percentage of historical games in each situation that have gone to either team. In fact, our model borrows this framework almost exactly.
The small tweak, made possible by our play by play data, is to adjust those numbers based on the flow of the game. For example, if a team is down by one goal with 1 minute left, I’d be immensely more confident in my team’s chances if they’ve been whipping high quality shots just over the upper corners for the last 90 seconds than if they’d just recovered the ball in their own end and had to complete a clear and then get into the offense. In other words, the most recent plays have some predictive power in what comes next. We think that this is very useful if you are publishing a real-time win expectancy score.
Think of it this way, time-score-possession allows you to identify which historical cohort of games your current situation is most like. And again, that is an incredibly useful benchmark when dealing with who is more likely to win a given game. But in our estimation, that approach does not take into account the teams and game flow of the specific game you are looking at. Time-score-possession gives a plodding team the same odds of coming back from 4 goals down as a lightning fast team. But we know intuitively that a team with a more quick strike style of play is more likely to come back from that deficit. (They are also more likely to lose by even more than the current score, but if you are just trying to predict who wins, this is irrelevant.)
And in future, this approach also opens the door to more sophisticated tweaks to the model. Currently, we don’t look at the plays a team has used to get to where they are. For example, a team that just expended a ton of energy to cut a 6 goal lead to 2 is probably less likely to win than a team that was up 1 and just gave up a quick three goal run to go down one. Knowing the plays that a team executed to get to where they are now could really help to quantify that disparity. An analysis for another day though.
As of Feb 2021, our model contains 8,180 D1 WLAX and D1 MLAX lacrosse games going back to 2015. Play values are calculated by counting the number of times that each play occurs, then counting the number of goals that are scored within the next 60 seconds, for and against, and then dividing by the number of times each play occurred. I cited the NCAA D1 Men’s play values above. The women’s play values look like this:
February 7, 2023 @ 10:52 pm
Play values are listed for DI Men and Women. Are They also available for D3 schools?
thanks, Jim O’Connell
February 8, 2023 @ 4:02 pm
They are indeed. Just sent you an email with the raw data from ’22.