Monthly Archives: March, 2020

BA on Balls in Play, Launch Conditions, and Random Effects


If you have been reading my blog over the years, you know that I talk a bit about random effects models where one shrinks measures of performance (such as a batting average) towards a common value. These shrunken estimates are much better predictors of future performance than standard measures that don’t make this adjustment. Also with the availability of Statcast data, I’ve devoted a lot of space on this blog talking about the use and interpretation of launch condition measurements such as launch angle and exit velocity towards positive outcomes like getting a base hit. I recently read an interesting paper on models that can incorporate general functions of fixed effects together with random effects. These are easy to fit using the mgcv package in R. I’ll review the use of some basic models and then show how one can construct and interpret these more sophisticated models.

Estimating Batting Average on Balls in Play from Launch Conditions

I am going to focus on a random sample of 100 players who had at least 300 balls in play during the 2019 season. We are focusing only on balls in play so we are excluding home runs. We want to estimate the probability p that a ball in play is a hit. We’ll use the generalized additive model of the form

logit(p) = s(LA, LS)

where logit is the function log(p/(1-p)) and s(LA, LS) represents a smooth function of Launch Angle and Launch Speed. This is easy to fit using the gam() function in the mgcv package in R. Below I show contours of the fitted probability of a hit as a function of the two launch variables. (Remember I am excluding home runs, but still hard hit balls over 100 mpg are likely to be base hits.) I think this display clearly indicates the sweet location in terms of LA and LS for obtaining a base hit.

A Random Effects Model

A different perspective focuses on estimating the probabilities of hit (on balls in play) for these 100 batters. One can also quickly fit a random effects model where one assumes (on a logit scale) that the hitter hit probabilities follow a normal distribution with mean m and standard deviation s. One estimates the mean m and standard deviation s from the data and this can be used to estimate the hit probabilities. Here I graph the estimates (on the probability scale) against the raw BA on balls in play. We notice several things. First, the observed BABIP values are shrunk about 50% towards the common value of 0.300. Second, there are some interesting outliers among my group of 100 — Tim Anderson and Bryan Reynolds on the high side and Jurickson Profar on the low side.

Understanding Differences in BABIP

Why do batters differ so much on their BABIP (batting average on balls in play) probabilities? Well, getting a hit on a ball in play depends on a number of variables such as launch speed, exit velocity, spray angle, and batter speed, and certainly batters differ on all of these variables. Unfortunately, all of these variables are confounded when we compare two players BABIP probabilities. We don’t know if one player has a higher BABIP probability than another player since he hits the ball harder, or hits the ball at a better launch angle, or is faster. Wouldn’t it be nice if we could somehow adjust these BABIP probabilities for some of these variables? This would help us understand better why players differ in their BABIP values.

An Adjustment Model for BABIP Data

The good news is that it is straightforward to fit a model that includes both launch condition variables and random effects. This model has the general form

logit(p_ij) = s(LA_ij, LS_ij) + b_i

where p_ij is the probability the ith player gets a hit on his jth ball in play, LA_ij, LS_ij are values of launch angle and launch speed for this ball in play, and b_i is a random effect where the collection of all random effects are assumed to be normal with mean 0 and standard deviation s.

How do we compare these two random effects models? In the basic model, the term b_i represents the ability of the player to get a hit on a ball in play. Now in this new model, b_i represents the ability of the player to get a hit after adjusting for launch speed and launch angle. So if we compare two players using the new random effect estimates, these estimates have already been controlled for launch angle and launch speed, and differences are solely due to other variables such as speed or spray angle.

Comparing the Two Random Effects Models

For each of the 100 players in my sample, I have two random effect estimates — one for the Constant model which has only random effects and one for the Regression model that includes the term for the launch conditions. I’ve constructed a scatterplot of the two sets of estimates below. As one might expect, there is a positive association in the graph, but there is a good amount of scatter indicating that the ranks of players differ using the two measures. For example, DJ LeMahieu has a high random effect for the constant model and an average random effect for the regression model. That indicates that LeMaheiu’s good BABIP performance is attributed to his good launch speed and launch angle measurements. In contrast, Mallex Smith has an average random effect for the constant model and high random effect for the regression model. This suggests that Smith’s strength in BABIP is due to other variables like speed and spray angle (placement of balls in play).

Summing Up

  • Here is a link to the paper on hierarchical generalized additive models that got me interested in this particular analysis, specifically the ease of fitting these models using the mgcv package.
  • On my Github Gist site, you’ll find all of the R code that I used for this analysis.
  • This is preliminary work, but hopefully it will encourage others to try these models.