R code for Probability of Hit Given Three Variables
Back in a January post, I showed some interesting graphs showing how the probability of hit depends on launch angle, exit velocity, and spray angle. Recently a reader asked about the R code to produce these graphs. So I will provide the code here and give an outline of the R work.
- First, I read in the Statcast data from the 2017 season (I had earlier scraped this data using Bill Petti’s baseballr package.). I only consider balls put into play and defined the binary variable Hit equal to 0 or 1.
- As before, adjusting for the batting side, I defined an adjusted spray angle phi1 that is equal to negative the spray angle (degrees) if the batter is left-handed, otherwise phi1 is the spray angle. So a negative adjusted spray angle corresponds to a batted ball that is pulled, and a positive spray angle is a batted ball hit to the opposite side.
- Now I fit the generalized additive model that looks likelogit(prob(Hit)) = s(launch angle, exit velocity, adjusted spray angle)
where s() is a smooth function of the three variables.
- Step 3 takes a little while on my laptop. Once I am done I can predict the probability of a hit given any values of the three variables. For example, what is the chance that a ball hit 90 mpg at a launch angle of 10 degrees, hit directly towards second base is a hit?
predict(fit, data.frame(launch_speed = 90, launch_angle = 10, phi1 = 0)) 1.55029
You might ask how the probability can be 1.55029? Actually, this is only the predicted logit — if you back transform by an inverse logit, you get a predicted probability of 0.825 — it is likely this batted ball is a hit.
5. Now I construct a graph to show the effects of the three variables — I set up a grid of values of launch angle, launch speed, and adjusted spray angle, and compute the predictions on that grid. Here is one of the graphs I constructed.
Obviously there many uses for these predicted probabilities. For example, I’d be curious in exploring how different teams are reducing (or maybe increasing) these hit probabilities based on their defensive alignments. That is, we’d be interested in looking at the residuals from these predictions and see which variables (like defense) are contributing to variation in these residuals.
But for now, I wanted to provide the details on the R work that can be found on my GitHubGist hub site.