Category Archives: Lahman

Chase Utley’s Hitting Slump

One of my favorite players is Chase Utley — I have a small statue of Chase holding the 2008 Phillies championship crown in my basement. As you may know, Chase is currently in a significant hitting slump. His current 2015 batting average is .103 and his BABIP (batting average on balls in play) is a remarkable 6 for 73 or .082. Of course, these statistics are not informative unless one understands more about values of BABIP.

In particular, we give some insight to the following questions:

  • In Chase Utley’s career, is it that unusual for his to have a BABIP that low in 73 opportunities?
  • How likely will this BABIP happen if Chase is a truly consistent hitter who gets a hit with a constant probability on balls placed in play?

Streaky Patterns in Utley’s Career

To answer the first question, we need to have Retrosheet play-by-play data for all of Utley’s seasons. I have a data frame bp.00.14 containing the Retrosheet data for the 2000 through 2014 seasons. I use the Master data frame from the Lahman package to find the Retrosheet id for Utley and create a data frame Chase containing the lines when Chase was the batter.

chase <- subset(Master, nameFirst=="Chase" & 
                        nameLast =="Utley")$retroID
Chase <- subset(pbp.00.14, BAT_ID == chase)
Chase$DATE <- substr(Chase$GAME_ID, 4, 12)
Chase <- Chase[order(Chase$DATE), ]

I want to limit my attention to PA’s where Chase has an in-play out, error, fielder’s choice, single, double, or triple. Also I want to include SF’s since those are in the denominator of the BABIP statistic. A new variable Outcome is created that records a 1 (0) if Chase has a hit (not hit).

Chase.IP <- subset(Chase, SF_FL==TRUE | 
                     EVENT_CD %in% c(2, 18, 19, 20, 21, 22))
Chase.IP$Outcome <- ifelse(Chase.IP$H_FL > 0, 1, 0)

For all sequences of 73 in-play opportunities (1 through 73, 2 through 74, etc) in Utley’s career, I use the filter function to compute the “moving” BABIPs and I plot them using ggplot2 .

F <- data.frame(AB=1:dim(Chase.IP)[1],
                            rep(1/73, 73))))
ggplot(F, aes(AB, BABIP)) + geom_line() + geom_smooth() +
  labs(title="Utley's Moving BABIP with Window of 73 Opps") +
  ylim(0.1, 0.5)


It is interesting the Utley has experienced many highs and lows in his “73 BABIP” in his career. There appears to be more volatility in BABIP early in his career. Although the general level of BABIP dropped later in his career, his 73-game BABIP values seem more stable. Although Utley experienced one stretch of in-play hitting where his BABIP was close to 0.5, he never has experienced a BABIP anywhere close to .082. This appears to be unusual for Utley.

Streaky Patterns in Random Data

Let’s address the second question. Suppose that Utley is a “real” consistent hitter in that the chance of him getting a hit in a ball in-play is equal to 0.300 (this is approximately Utley’s career average). By chance would we see him going through a 73 in-play sequence where his BABIP would be as low as 0.100?

This type of random experiment is easy to program in R. I program one simulation using the function one.simulation where the input prob is the constant probability of a hit in play. I generate random outcomes (using the built-in runoff function) and output a data frame with the moving averages.

one.simulation <- function(prob){
  N <- dim(Chase.IP)[1] <- as.numeric(runif(N) < prob)
                        rep(1/73, 73))))

I repeat this 9 times below with the constant probability 0.300, storing the results in a single data frame. I graph the individual moving average plots using a single ggplot2 expression using facets.

for (j in 1:9){
  SIMS <- rbind(SIMS,
                data.frame(Run=paste("Simulation", j),
ggplot(SIMS, aes(AB, BABIP)) + geom_line() +
    geom_smooth() + facet_wrap(~ Run, ncol=3) + ylim(0.1, 0.5)


The main point to observe is that there is much variation in the moving BABIP values even assuming that Utley is a constant 0.300 hitter. We see several values of BABIP close to 0.500 and there is a single BABIP value close to 0.100.

I think it is premature at this point in the season to conclude that something is wrong with Utley’s hitting. Hot streaks and slumps are pretty common in “random” data, although the media is quick to give a explanation to each of these occurrences. I know there are some possible explanations for this performance such as not hitting many hard-hit balls, but bad luck may be part of the story.

For more information about using R to explore moving averages and streaks in hitting data, look at Chapter 10 of Analyzing Baseball Data with R.