Estimating the goal value of each type of play
With enough data points, it becomes possible to calculate the expected effect of a given play on goal outcomes. We can argue about how many occurrences you need for a model to converge on an expected value, but fundamentally, if a certain type of play more often results in that team scoring a goal than their opponent, we can say that it has a positive expected value.
In the simplest example, if you assume that Maryland commits a penalty, and Notre Dame has a 60 second man up opportunity, then within the next 1 minute, would you expect Maryland or Notre Dame to score more goals on average? Notre Dame obviously, so we can say that the penalty committed by Maryland has a negative expected value. I’ll go out on a limb and say that this is not a stretch for those familiar with lacrosse…or averages.
In the list below, assume that Maryland is the one initiating each play. The Diff_per column shows how many goals Notre Dame would be expected to score in the next 60 seconds. (For example, a saved shot here means that Maryland shot on goal and Notre Dame saved it; we would expect Maryland to score .17 goals in the next 60 seconds to .12 for Notre Dame.)
|Shot Clock On||0.301||0.090||0.211|
|Penalty – 0 sec||0.029||0.143||-0.114|
|Penalty – 3 min||0.107||0.250||-0.143|
|Shot Clock Violation||0.040||0.184||-0.144|
|Penalty – 2 min||0.106||0.394||-0.288|
|Penalty – 30 sec||0.052||0.396||-0.344|
|Penalty – 1 min||0.053||0.427||-0.374|
A few things stand out, some expected (and helpful for confirming the validity of the model) and some not. First, if you commit a penalty, then you should expect to give up more goals that you score over the next 60 seconds. Makes sense. Same goes for turnovers and shot clock violations, if you give up possession, good luck scoring more than you give up.
On the flip side, the two goal entries are interesting. It’s minimal, but both assisted and unassisted goals have a positive expected value, meaning in addition to the goal scored, the model says you should expect .03 to .04 goals in the next 60 seconds. It’s tiny, but this speaks to the value of momentum? Or perhaps it is less causal; just an implication of better face off men leading to more goals as well as more runs. Most likely, it’s just the fact that a team scoring a given goal is minutely more likely to be the better team, and they should score more goals. (Note to self: worth another post.)
After thinking more about this one, it makes more sense, but I was surprised to see Missed Shot as the highest net goal getter in this list. (Mind you, our model differentiates between blocked shots, pipe shots, saved shots, and missed shots.) In retrospect, it probably makes sense; if you have a completely missed shot, it’s probably getting backed up, so you still have possession. Contrast this with a pipe shot which could go anywhere. Also, a missed shot also means you are in attacking position with, presumably, enough offensive action to warrant a shot. Contrast that with a ground ball, which means you have possession, but it may not be in the attacking end.
So what do we do with this information. Well for one, we don’t start whipping shots over the goal from midfield. This is a purely descriptive analysis; if someone in a wacko-universe adjusted strategy on account of this, it would be like the snake eating its own tail.
For starters, it would be very interesting to see whether there are certain teams for whom these values are starkly different. Are there certain styles for whom a missed shot is not in fact a good thing? Are certain teams better able to capitalize quickly on the change in momentum from a turnover? All interesting questions that speak to style I suspect.
In addition, we posted earlier this month on our win-expectancy model, and these play-specific values are integral to that model. Time, score, and possession create a very effective framework for determining win expectancy. By looking at historical games, you can create a pretty accurate model based on the percentage of historical games in each situation that have gone to either team. In fact, our model borrows this framework almost exactly.
The small tweak, made possible by our play by play data, is to adjust those numbers based on the flow of the game. For example, if a team is down by one goal with 1 minute left, I’d be immensely more confident in my team’s chances if they’ve been whipping high quality shots just over the upper corners for the last 90 seconds than if they’d just recovered the ball in their own end and had to complete a clear and then get into the offense. In other words, the most recent plays have some predictive power in what comes next. We think that this is very useful if you are publishing a real-time win expectancy score.
Think of it this way, time-score-possession allows you to identify which historical cohort of games your current situation is most like. And again, that is an incredibly useful benchmark when dealing with who is more likely to win a given game. But in our estimation, that approach does not take into account the teams and game flow of the specific game you are looking at. Time-score-possession gives a plodding team the same odds of coming back from 4 goals down as a lightning fast team. But we know intuitively that a team with a more quick strike style of play is more likely to come back from that deficit. (They are also more likely to lose by even more than the current score, but if you are just trying to predict who wins, this is irrelevant.)
And in future, this approach also opens the door to more sophisticated tweaks to the model. Currently, we don’t look at the plays a team has used to get to where they are. For example, a team that just expended a ton of energy to cut a 6 goal lead to 2 is probably less likely to win than a team that was up 1 and just gave up a quick three goal run to go down one. Knowing the plays that a team executed to get to where they are now could really help to quantify that disparity. An analysis for another day though.
As of November 2016, our model contains 888 NCAA DI Men’s lacrosse games from the 2015 and 2016 season. Play values are calculated by counting the number of times that each play occurs, then counting the number of goals that are scored within the next 60 seconds, for and against, and then dividing by the number of times each play occurred. Full data is here:
|Shot Clock On||1493||449||134||315||0.301||0.090||0.211|
|Penalty – 0 sec||35||1||5||-4||0.029||0.143||-0.114|
|Penalty – 3 min||28||3||7||-4||0.107||0.250||-0.143|
|Shot Clock Violation||174||7||32||-25||0.040||0.184||-0.144|
|Penalty – 2 min||66||7||26||-19||0.106||0.394||-0.288|
|Penalty – 30 sec||2443||127||967||-840||0.052||0.396||-0.344|
|Penalty – 1 min||3151||167||1347||-1180||0.053||0.427||-0.374|
For the sake of double checking, we ran the data on the 408 2015 games vs the 480 2016 games to make sure that the values were consistent. The R2 value was .99, so yes, these stand up across years.
March 23, 2017 @ 8:55 pm
“but both assisted and unassisted goals have a positive expected value, meaning in addition to the goal scored, the model says you should expect .03 to .04 goals in the next 60 seconds. It’s tiny, but this speaks to the value of momentum?”
Or it’s a statistical fluctuation? What’s the estimated error on this number?
March 23, 2017 @ 9:12 pm
What kind of error metric did you have in mind? In terms of the raw data (table also included at the bottom of the post), there were 9,819 assisted goals and 7,565 unassisted goals in the data set, so the sample size is substantial.
February 6, 2018 @ 12:46 pm
Let’s take the 9819 assisted goals. If I’m understanding your table correctly, the team that scored went on to score another goal in the next minute 1479 times, and the team that was scored on went on to score a goal in the next minute 1115 times. Assuming goal-scoring is a Poisson process, the standard-deviation error on each number is the square-root of each: 38.46 for 1479, and 33.39 for 1115, and these errors are independent. The difference 1479-1115=364 then has an error which is the square root of the sum of the squares of the errors on each number, which is then the square root of sum 1479+1115=2594; this is 50.93. The difference 364 is 7.15 times the standard deviation 50.93, and is therefore very unlikely to be a statistical fluctuation.
This is the sort of error analysis I wish you would do. Merely saying “the sample size is substantial” is not enough to draw any actual conclusions, you have to work through the numbers.