Generating Hit Probabilities from Statcast Data

Introduction

Statcast introduced hit probabilities this season — it addresses the question “Based on the exit velocity and launch angle of the batted ball, how likely is a ball to land for a hit?”   In this post, I’ll explain how one can use Statcast data together with a generalized additive model (or gam for short) in R to estimate these hit probabilities.  I’ll produce an interesting graph showing the “hit zone” as a function of the launch angle and exit velocity and later discuss a simple way of assessing the goodness of fit of this fitted model.

The Data

Using Bill Petti’s baseballr package, it was straightforward to download Statcast data for all hitters for all pitches in the 2017 season.   I obtain a data frame with over 735,000 rows and 79 variables including the outcome of the pitch and the exit velocity and launch angles for the balls put into play.  I only consider the batted balls, reducing the dataset to over 129,000 rows.

An Exploratory Graph

To get some sense of how launch angle and exit velocity are related to hits and outs, I graph data for a sample of 2000 balls in play — the black dots correspond to outs and the blue dots correspond to hits.  Generally we see that balls hit at higher velocities tend to be hits.  Also launch angles between 0 and 25 degrees seem desirable.  But the relationship between launch angle and exit velocity for the hits is not completely clear.  This motivates considering a nonparametric regression model like a gam which allows the probability of a hit to be a smooth function of launch angle and exit velocity. The GAM Modeling

I divided the data frame of 129,000 balls in play into two groups — I will fit a model to 60,000 observations and then I will test the model on the remaining data.  Let p denote the probability that a batted ball falls as a hit.  The specific generalized additive model says that the logarithm of the odds (p divided by 1 – p) is a smooth function (called s) of the two inputs — that is

log(p / (1 – p)) = s(launch_angle, exit_velocity)

This model is easy to fit using the gam function in the mgcv package.  (It is no harder to fit this using gam than using the popular function lm in fitting a regression model.)  I could say more about the gam algorithm, but I’ll just say that it a popular method of fitting a regression model when one is unsure how the input variables influence the response variable (it allows for complicated relationships between the inputs such as the interesting relationship between launch angle and exit velocity seen above).

Once we have fit this model, we can use it for prediction — for example, if the launch angle is 20 degrees and the exit velocity is 100 mph, what’s the chance of a hit?   I computed the estimated probability in this case using this model to be 0.58.

To understand the pattern of predictions using this method, we set up a 50 by 50 grid of values of (launch angle, exit velocity) and estimate the probabilities of a hit over this grid.   Here is a contour graph of the predictions — the hot (yellow) area corresponds to the region where the estimated hit probability is high.  I’ve added a black vertical line — the region to the left of the line corresponds to groundballs. There are two big sweet spots — the region where the launch speed exceeds 110 corresponds to (primarily) home runs, and there is a second sweet spot around 80 mpg and 12-20 degrees corresponding to singles. Is this a Reasonable Method?

I suspect that Statcast uses a different fitting algorithm for estimating their hit probabilities.  That’s fine — we have a lot of data and I suspect that there are a number of algorithms that will give similar results.  But how can we assess the goodness of fit of our fitted model?  A simple way is to use this fitted model to predict the outcome (hit or out) on the test data (the data that was not used to fit the model) and compute the proportion of successful predictions.

Here are some illustrations using this criterion.

1.  Suppose we predict that every batted ball will be out. (This is a dumb method, but it is a simple prediction method.)  Then the proportion of successful predictions will be 0.67, the fraction of batted balls that are outs.  Certainly we can improve over this prediction method.
2. Instead suppose we predict that batted balls hit over 100 mpg will be hits, and the batted balls hit under 100 mpg will be outs.  We compute that the proportion of successful predictions to be 0.724 — better than the “all batted balls are outs” method.
3. How about our gam method (illustrated using the contour graph)?  We compute that the rate of successful predictions is 0.801.  I don’t know the Statcast algorithm for determining the probability of a hit (given the launch angle and exit velocity), but I would think that their algorithm would have a similar correct classification rate.

Going Further

As usual, I have posted all of my R code on my GitHub gist site.

Although I have a basic understanding of the importance of launch angle and exit velocity in terms of hits and outs, there are many questions going forward — I’ll just list a few questions that may motivate a good study by the interested reader.

• We have focused on averages, but certainly hitters have different talents for getting base hits.  For example, what is the impact of launch angle and exit velocity for a hitter who specializes in getting infield hits or putting the batted ball outside the range of the defense?  What is the launch angle and exit velocity effect for a player who is especially good in getting on base?
• How does a team’s defense impact the relationship of launch angle and exit velocity with base hits?
• What are other relevant inputs?  Due the the current prevalence in defensive shifts, teams must think that batters’ tendencies in the direction of the batted ball are important.   What are the pitcher effects?