Monthly Archives: February, 2020

Predicting AVG using xBA and a Multilevel Model


I’ve been doing some exploring of batting data for the 2017 season, trying to better understand the advantages that the Astros hitters may have had by stealing signs. In the process, I thought it might be better to work with expected batting average (xBA) rather than batting average (BA) since it is more stable and reflective of a player’s hitting ability. That brought up a interesting side question — how much better is xBA than BA in prediction? And how does xBA compare with my favorite statistical method — multilevel modeling — in predicting future BA? This post is devoted to exploring this side question.

The Prediction Problem

Let’s pose the prediction problem. Suppose we divide all of the ball-in-play events in a particular season randomly into two equal parts — we’ll call one part the Training data and the second part the Test data. We are interested in using players’ in-play batting averages in the Training dataset to predict their in-play batting averages in the Test dataset. (We’ll focus this prediction for non-pitchers who have a reasonable number of balls in play.) We describe several ways of making these predictions.

Using the In-Play BA

The obvious thing to do is to use a player’s BA on balls in play in the Training data to predict his BA in the future Test data. The problem is that this in-play BA is a type of batting measure that has a large “luck” component and it really does not do a good job in predicting future performance. Okay, the current BA is crummy in prediction — can we do better?

Using the Expected BA

Recently there has been a lot of attention in using launch variables such as launch velocity and launch speed to develop a better measure of batting ability. Suppose we use our Training data to fit a model of the form

Prob(Hit) = s(LS, LA)

where s() is a smooth function of the launch velocity and launch angle. If we do this, we obtain an estimate of a batter’s hit probability for any values of the launch variables. If one sums these hit probabilities over all batted balls for a given player, one gets the Expected BA (xBA). Since the launch variable measurements are believed to be more reflective of ability rather than chance, one believes that these xBA values will be better in predicting future BA performance in the Test data than the raw BA values.

On a side note, I think that Expected BA statistics are helpful in many contexts. For example, here is a graph showing how the xBA measure depends on the count (2017 season data).

Using a Multilevel Model

The use of Expected BA is attractive since it is using additional information, namely the launch velocity and launch angle measurements. But there is a simpler method that focus on the simultaneous estimation of a set of batting abilities. We ignore the launch variables and just fit an exchangeable model to the set of hitting probabilities. The number of hits in play y is assumed to follow a binomial distribution with probability of hit p. The set of hitting probabilities p1, …, pN are assumed to follow a common beta curve with shape parameters a and b. The beta parameters a and b are assigned weakly informative distributions. Given hitting data for a group of a players, this model is quick to fit, and it provides estimates for the batting probabilities that can be used to predict the BA in the future test data.

A Prediction Contest — Which Method Does Best?

Okay, here is a simple experiment to show how these three prediction methods perform. I’m using 2017 data, although this procedure can be applied for any season data.

  1. In 2017, we have 126,625 batted ball plays excluding sacrifices. I randomly divide these plays into a Training dataset of 63,312 batted ball plays and a Test dataset of 63,313 plays. By randomly dividing the BIP, I am avoiding any effect due to the month of the season.
  2. I use a fitted GAM model (binomial sampling, logit link) to compute the estimated probability of a hit for all values of the launch variables for the Training dataset.
  3. I focus only on the players who have at least 100 balls in play in both the Training and Test datasets — there are 276 players in this group.
  4. For each player using the Training data, I have three different predictions of his BA performance in the Test dataset: (1) the Observed BA, (2) the expected BA (xBA), and (3) the estimate of his hitting probability from the multilevel model.
  5. I evaluate the prediction performance by computing the sum of squared prediction errors — the lowest value is best.

Here are the results of my prediction contest. We see that the multilevel predictions are best, followed by xBA and BA.

What Have We Learned?

If you try to replicate my experiment you will get slightly different results since I am randomly dividing the 2017 data into Training and Test parts. (Actually I provide the random number seed so you can replicate this work.) Even if you don’t use my random number seed, you’ll discover the following main results:

  • The poorest estimate of future BA is the observed BA which is not surprising. One really can’t take a batting average very seriously if you are interested in predicting future performance.
  • Both the Expected BA and Multilevel Model estimates give better predictions than the naive BA estimate, but the multilevel estimates are clearly the best.
  • This is a bit surprising since no information about launch conditions is being used in the multilevel model estimates. Basically, the multilevel model estimates shrink the raw BA’s towards an overall average, where the degree of shrinkage depends on the luck/skill characteristics of the batting measure. (For a batting average, you’ll get a relatively large degree of shrinkage since BA has a large luck component.)

Some Comments

  1. Value of Expected BA. I don’t want to sound too critical of expected BA. Researchers have understood that BA’s are pretty unstable and and using additional information in the launch conditions will lead to better predictions of BA.
  2. Multilevel Modeling. I hope this simple exercise will help to convince the reader of the value of multilevel modeling. This is an intelligent way of combining information from different sources and is especially relevant for baseball data. It is difficult to understand patterns when one has a limited amount of data for individual players and one has to combine data for many players to understand the underlying behavior of baseball measures. Here is one of my posts where I describe multilevel modeling to estimate batting abilities for Efron and Morris’ famous example.
  3. Can We Do Better? Actually, we can. I think a better method (than the ones proposed above) would be to first model the probability of a hit as a function of the launch variables (a regression model), and then use multilevel modeling to reflect the belief in similarity of the individual regression models. Using this, I think we could even get better predictions of future BA.
  4. Got Code? On my Github Gist site, I show Markdown code for implementing this prediction exercise. I use Statcast data for the 2017 season, the mgcv package is used for doing the GAM fitting, and the multilevel model fit is done using my BApredict package.