There is much interest in batter-pitcher matchups. The media likes to report batters who have unusually good or poor batting averages against specific pitchers. That raises the obvious questions:
- Is there any information in these batting averages? Many of these batting averages are based on small numbers of at-bats, so maybe we are just reporting interesting noise? This post says to be careful about interpreting this type of data.
- If there is some information in these batter/pitcher splits, how can we report these splits in some meaningful way?
In this post we use R and Retrosheet data to find batting averages of all hitters who face a particular pitcher of interest. I will focus on all hitters who faced Jamie Moyer who had a long pitching career. We see that these breakdowns are hard to interpret due to the different sample sizes. We apply some model smoothing to get reasonable estimates at the true batting averages.
Collecting the Retrosheet Data
I first collect Retrosheet play-by-play data for all seasons from 1960 through 2013. After downloading this data, I store all of this data in three R data frames — the “pbp.60.79” data frame contains play-by-play for the 1960 through 1979 seasons, a second “pbp.80.99” contains data for seasons 1980 through 1999, and a third “pbp.00.13” contains for seasons 2000-2013. I store this data into three workspaces and load these into R before using the function below. (By the way, 54 years of Retrosheet data represent about 8.9 million lines of plays and 99 variables, but R seems to handle that amount of data fine.) Currently, these R workspaces can be downloaded from my web server.
load(url("http://bayes.bgsu.edu/baseball/pbp.1960.1979.Rdata")) load(url("http://bayes.bgsu.edu/baseball/pbp.1980.1999.Rdata")) load(url("http://bayes.bgsu.edu/baseball/pbp.2000.2013.Rdata"))
The Raw Batting Averages
Suppose we collect (H, AB) for all batters for a given pitcher and plot the batting averages H/AB against the AB. A typical example of this graph is given below — we plot the batting averages for all 1436 batters who faced Jamie Moyer. I use the number of hits as the plotting point in this graph.
Clearly, this graph is hard to interpret. There is much variability in the batting averages for players with small numbers of at-bats. There were players who had 1000 and 0 batting averages against Moyer, but certainly these weren’t the best or worst hitters against Jamie. We suspect that the best hitters are the ones who faced Moyer a number of times, but this cloud of chance variability is making the identification of these best hitters difficult.
Smoothing the Batting Averages
Fortunately, there is a straightforward way to make sense of this data. For all of the 1436 batters, we imagine that there are underlying hitting probabilities , and we’d like to estimate all of these probabilities. A reasonable model (a so-called random effects model) says that these probabilities come from a beta curve and we estimate the parameters of this curve from this data.
When we fit this model, there will be estimates that we call and . The estimate is a typical batting average, and is an indication of the spread of the probabilities with smaller values indicating more spread.
One good feature of this approach is that it gives us smoothed estimates at the hitting probabilities. A good estimate at the hitting probability for a particular player is given by
For example, Chris Gomez was 3 for 30 against Jamie Moyer in his career for a 3/30 = .100 batting average. From the Moyer data, we estimate and . So I estimate Gomez’s true hitting probability against Moyer to be
. This estimate smooths the raw average towards the overall Moyer batting average of 0.266.
Assuming you have all of the Retrosheet data available in your R workspace, the function
batter.matchup.ggplot will (1) find all of the (H, AB) data for all batters who face a specific pitcher, and (2) fit the random effects model, and give you the estimates of and , and computes and graphs these smoothed estimates. The function can be found at my gist site here. I illustrate this function for Jamie Moyer. (Actually when you source this gist, you will load in all of the Retrosheet data into the workspace and then read in the function. After this function is sourced, you can apply
batter.matchup.ggplot for different pitchers who played in the seasons 1960 through 2013.)
library(devtools) source_gist("2b70b09a447a3d06188b") B1 <- batter.matchup.ggplot("Jamie Moyer")
Note that these smoothed estimates adjust the batting averages for the hitters with small values of AB heavily towards the average. In contrast, the smoothing effect is smaller for the players with large number of AB.
The output gives the estimates of and and ranks the hitters who had the largest (and smallest) probability estimate against Moyer. The data frame
B1 contains the H, AB, and improved estimates for all hitters. The top and bottom 2 hitters are labeled in the graph.
### eta K ### 0.265939 194.454480 ### ### BAT_ID AB H Batter.Name Estimate Num ###1 kotsm001 36 21 Mark Kotsay 0.3155202 1 ###2 willb002 87 34 Bernie Williams 0.3045360 2 ###3 delgc001 82 32 Carlos Delgado 0.3028095 3 ###4 wrigd002 55 23 David Wright 0.2995057 4 ###5 snowj001 21 12 J. T. Snow 0.2957146 5 ###6 galaa001 41 17 Andres Galarraga 0.2918315 6 ### ### BAT_ID AB H Batter.Name Estimate Num ###1 velar001 49 8 Randy Velarde 0.2452739 1431 ###2 hernr002 41 6 Ramon Hernandez 0.2451133 1432 ###3 nixoo001 21 1 Otis Nixon 0.2446597 1433 ###4 bross001 46 7 Scott Brosius 0.2441753 1434 ###5 gomec001 30 3 Chris Gomez 0.2437600 1435 ###6 randj002 48 7 Joe Randa 0.2421611 1436
We see that by this table, Mark Kotsay had the highest probability estimate against Moyer, and Joe Randa had the worst. We get some interesting reversals in this ranking. For example, who was better — David Wright with a 23/55 = 0.418 batting average or Carlos Delgado with an average of 32/82 = 0.390? Wright had a higher average, but Delgado’s performance was based on a larger number of AB — we have more information about Delgado’s true batting average. Our method makes an appropriate adjustment — Delgado’s estimate of 0.303 is actually higher than Wright’s estimate of 0.300.
This function is fun to apply and you are welcome to try it out for any pitcher of interest. One thing you learn is that for some pitchers, the variability between the hitting probabilities is small (I am thinking of Nolan Ryan) and other pitchers (think of Dennis Eckersley) had a high variability. Pitchers who are especially tough against batters on the same side will have small estimates of .
Once you have the Retrosheet data and the function, you can look at Eckersley and Ryan’s splits by typing:
B2 <- batter.matchup.ggplot("Dennis Eckersley") B3 <- batter.matchup.ggplot("Nolan Ryan")
The lesson here is that batter-pitcher matchups data are not just noise. There is variation between the true batting averages against a specific pitcher and one can measure this variation by a random-effects model. This allows us to smooth the raw batting averages in a reasonable way to learn about the good and bad matchups.