Predicting 4-year value from Freshman stats
We’ve been doing a series of posts on geography in lacrosse, but today we’ll take a quick detour off that route. Today, we’ll talk about projecting production levels. In reality, this is a necessary detour in our geography series. But whether you’ve been following that or not, this post works as a standalone as well.
The reason that it’s a necessary detour is simple: to continue our geography series, we need to look at the effect of player talent, program pedigree, and geography on recruiting decisions. We can estimate program pedigree using historical records. We already know the geography piece. That leaves the player talent piece.
We talked about ideas for the model in a previous post, but the basic idea is to try and figure out whether the talent of the player affects which school they end up at and how geography plays a role. Do top tier talents pick the best pedigree’d schools, regardless of how close they are to home? In order to answer that, we need to have a way to approximate the talent level of each player. And since we want to include all 4 years of data in our analysis, that means we’d need to include last year’s freshmen.
How to go about projecting production levels
It would be simple to use our play value expectations to total up each player’s contribution for their career. But to have an apples to apples comparison, we’d need to separate each of the four classes into separate groups so that we’d be comparing seniors to only seniors, juniors to juniors, etc. Rationale for that being that if you didn’t, you’d say that a mediocre senior is way better than the best freshmen. You could divide each player’s total output by the number of seasons played, but you’d still end up skewing things because production isn’t linear.
So instead, we decided to see if it would be possible to use freshman year production stats to project career production. Going into this post, I had no idea whether the model would prove predictive at all. There are so many factors that affect production, not least injuries, so it wouldn’t have been surprising to see that there was no reliable way to use one year to project out a whole career. But hey, science is trial and error, so if it didn’t work out, we could publish and save some poor schmuck the same pain and suffering at some point in the future.
So what did we find…
You can use freshman year stats to project 4-yr production!
It’s by no means a perfect model. And I will be the first to call out the lack of sophistication. But it does seem like freshman year stats are a good enough input to be able to project the full production of a player over 4 years. For any one player, it could obviously be way off, but for the purposes of building a projection estimation model, it’s passable.
I won’t go into a ton of detail on the model here (more at the end of the post), but the basic steps were:
- Count up the various play-specific contributions (i.e. shots/goals/penalties) for each player in each season
- Use a Monte Carlo simulation on seniors with full data to identify which play types increase predictive power (yay feature engineering!)
- Use a basic linear regression model to assign coefficients to each play type (used a 80%/20% training/test split)
- Apply coefficients to freshmen class from ’17 to project production, giving us an apples to apples comparison of underclassmen and seniors
Again, more detail on the model and the limitations of the model at the end of the post. But for now, here’s how it did. The chart below shows the results of the model calibration.
There were 385 seniors that had sufficient play data in all four years to build a projection. I split the population 80/20 into a training and test set. The model was calibrated on the training set, and the resulting coefficients were used to “project” the career production of the remaining 75 players who were not included in the calibration.
The chart plots the projected production against the actual production. Again, these are all seniors, so we know what they did as freshmen, and we know what they did over 4 years. The chart demonstrates that directionally, a linear regression model can approximate career production reasonably well using just freshman-year stats.
And when you consider that we didn’t account for roster construction or injuries, I think that is a pretty solid result. Enough to pass the sniff-test so that we can use our simple model to project the current crop of rising sophomores.
All Hail King Walker
Based on our model, we would project Ethan Walker to have the most productive career of any 2017 freshman. Of course, he’ll be asked to shoulder a larger load for the Pios after the graduation of Connor Cannizzaro, so an uptick in production wouldn’t surprise. As a reminder the value in the chart is expected goals added. Expected goals are earned by doing anything that leads to a goal (i.e. shots, ground balls). They are lost by doing things that tend to lead to the other team scoring (i.e. turnovers).
When you apply the model developed in prior steps to this past year’s freshman class, the expected career production list looks like this:
|Ethan Walker||Denver||318.25||1||Dylan Molloy|
|Tre Leclaire||Ohio State||273.32||3||Connor Cannizzaro|
|Jeff Teat||Cornell||258.24||4||Matt Rambo|
|Mac OKeefe||Penn State||229.81||5||Tucker James|
|Michael Kraus||Virginia||206.41||6||Jack Bruckner|
|Luke Mccaleb||Brown||186.05||8||Jack Curran|
|Aaron Forster||NJIT||171.44||10||Luke Goldstock|
|Joey Manown||Duke||166.56||13||Joe Seider|
|Dox Aitken||Virginia||151.64||15||Ian King|
For a bit of context, 318.25 career expected goals works out to roughly 5 expected goals added per game if you assume 16 games per season and 4 seasons in a career. In the quarterfinal round, Connor Kelly amassed 5 expected goals added for the Terps in their win over Albany. He scored 5 goals on 9 shots, picked up 2 goals and had 2 turnovers. There are many ways to achieve 5 expected goals in a single game, but that’s one example. The model projects Ethan Walker to average a stat line like that.
Anyway, if you think back to last season, these were most of the names that you heard announcers fawning over. But there are also names from some of the more lightly-regarded teams, including NJIT, Delaware, Lehigh, etc. If you’ve got a guy coming off a freshman season that put him on this list, you’ve got to like your chances over the next several years. That probably goes double for Stony Brook, who managed to get 2 guys onto this list.
For fun, we’ve also listed the closest comp for each player when you compare their projected career production to the class of seniors from this past season. Blanks indicate the the same comp would be assigned as above, we just didn’t want to list the same names over and over. The interesting thing about this list of freshmen is how dominant they were, relative to the pace set by the top seniors. Dylan Molloy was the most productive senior of this year’s class, generating 287 expected goals in his 4 years. Both Walker and Michael Sowers are projected to top that mark.
Granted, there are many things that could happen over the next three years. Hopefully no significant injuries to any of these players. You could even make a case that, for some of these guys, improved talent around them may actually limit their production. And that could be a good thing. A star often gets limited while his teammates make hay.
But all in all, it’s helpful to see what the projection model thinks about the current crop of rising sophomores.
Next step is to use this model to give each D1 player a “talent” level
If we want to continue our geography analysis, the next phase is to try and assess the degree to which “talent” affects the likelihood that a player will end up at a school that is close to home. You could imagine that the top players, with offers from the best pedigree’d schools might be more likely to go far from home. And perhaps players not as high on the recruiting radar may end up in their own backyards.
I have no idea if either of those is the case, but until we had a way to understand each player’s “talent” level in an apples to apples way, we couldn’t go there. Now we can.
Now again, there are lots of limitations inherent in that approach. Namely, the fanfare that a player received out of high school may have little to do with his eventual career production. There are tons of stories of late bloomers out there. And in theory, it would be better to use recruiting attention as the proxy for talent level since we care about a decision made before they even suit up. Sadly, that’s not possible.
So for the time being, we’ll assume that the best players were the most sought-after recruits. Suck it up readers, we are in an information-poor sport here…
Down the road, there may be enhancements to this model
I mentioned at the beginning, this was an admittedly simplistic model. It just doesn’t make a ton of sense to spend weeks perfecting a model to project production when it’s not even the best metric to use in this analysis. All we are really trying to do is frame geography and recruiting in a unique way.
That said, any good data scientist has experienced that nagging feeling that their model’s predictive power could be enhanced with one more variable or a new algorithm. We are no different here at lacrossereference.com. As I was writing up this analysis, a few ideas popped into my head.
We could add play shares as an independent variable. For every player on every team, we have already calculated the total share of plays that each player contributed. This is done by looking at the number of plays a player made divided by the total number of plays for the team.
If a player has an extremely high share of their team’s plays, that could suggest a relatively thin roster. Perhaps if we included that as a variable, we’d be able to refine the model to account for the fact that some freshmen are forced into a high production role whereas others just rise above the rest of their teammates on sheer talent. Might require a shift to a decision-tree model and away from the basic linear regression approach, but it could be interesting.
Perhaps it would be worth it to go through the data and try to identify when players missed games or time with injury. If we adjusted our senior player data to account for obvious gaps in their production (i.e. 4 straight games with no plays made), conceivably, we could have a more accurate estimate of career production on a per/game basis.
Of course, how does one tell when a player (especially one without a large volume of plays) was injured versus benched for non-production or other reasons? Could be that trying to introduce that adjustment without having a way to confirm each one could reduce the model’s accuracy. And no way I’m going through articles from 4 years to identify injured vs benched.
Bottom line, there are probably a dozen ways that this model could be made more predictive. Of course, “Done is better than perfect,” so for the time being, we’ll keep this model in the “Done” category. Lots more interesting lacrosse analytics concepts to explore in the interim.
Update: we messed up
In the original version of this article, we accidentally included recent graduate from Hobart, Frank Brown on our list of top freshmen. The list has been updated and Stony Brook’s Alex Corpolongo has rolled into the 15th slot. Thanks to the LaxPower community for pointing out our mistake.
Oops we did it again
You won’t believe it, but Alex Corpolongo was a senior last year as well! Thanks to another LaxPower reader for pointing out our mistake. I went through and checked everyone else on the list and we should be good to go now.
Appendix: Additional Commentary on the Model
One of the things that stood out in the model calibration step was how much the model weighted goals scored as a freshman when projecting future production. To some degree this makes sense; goals are the highest value thing you can do on a lacrosse field. If we want to project one year of stats to a career of production, we’d expect to see a roughly 3 to 4x increase from expected goals to our model coefficients.
This chart shows, for the play types included in the model, what the on-field expected goals value is against the weight the model gives each play type in projecting 4-year production. The y-axis basically a measure of expected goals. In blue, the play values are the actual on-field estimates of how many goals result from each play type. The green bars are how much career production we’d expect to see for each instance of each play type. (I didn’t put numbers on the y-axis because the point here was to show the relative difference.)
To be perfectly clear, there is no reason that the coefficients should necessarily be similar and the expected-goals value. Expected on-field play values are taken from all players in every game and aggregated. They are a backward-looking measure. The coefficients you see in green represent a translation of those expected values into a future production estimate. And the model says that to accurately do that, you need to weight goals more heavily.
Surprisingly, the factor that is weighted far less than the others was ground balls. In other words, the number of ground balls that a freshman picks up is not nearly as predictive of their overall career production. Would be interesting to dig into why that is the case. My gut reaction is that ground balls are inherently fairly random, so a skill in picking up ground balls may not translate from year to year in the same way that scoring goals would.
Another point that deserves a bit more explanation is the feature engineering step of this process. There are 7 counting statistics that we could have used for this model: goals, shots, assists, turnovers, penalties, ground balls, and total plays. The Monte Carlo step of the process was to determine which combination of metrics produced the most accurate models. We ran through thousands of scenarios, randomly shuffling which players were part of the training and test set and randomly selecting combinations of metrics to include. Eventually we had enough aggregate data to identify which metrics were and were not helpful.
As a result, assists and turnovers were removed from the model. Curiously, the optimal models did include total plays, which I’ve marked as Other. Since the coefficient on that was negative, I tended to think of it as “Turnovers+”. But going through the feature engineering step was important to remove superfluous variables that didn’t improve the predictive power of the model.