As we know, baseball is no longer America’s game — I believe baseball has been surpassed in popularity by American football. One problem is that MLB games are long — games in the 2012 season were significantly longer than the games twenty years ago. In this post, we’re going to try to learn more about the variables that control the length of a baseball game. In doing so, we’ll illustrate merging Retrosheet play-by-play data with Retrosheet game logs.
A baseball game is essentially a sequence of pitches, so I would think the time of a game would be strongly related to the number of pitches.
The number of pitches in each game during the 2012 season can be obtained using the Retrosheet play-by-play files. I assume that a single text file “all2012.csv” has been saved in a “data” folder and a “fields.csv” file in this same folder contains the names of all of the variables. (We explain this downloading process in Appendix A of our book.) We read these files into R and save the play-by-plays in a data frame called
season <- 2012 file.name <- paste("data/all", season, ".csv", sep="") plays <- read.csv(file.name, header=FALSE) fields <- read.csv("data/fields.csv") names(plays) <- fields[, "Header"]
PITCH_SEQ_TX gives the pitch-by-pitch sequence and other events like pickoff attempts and steals for each play. We remove all non-pitches from this variable using the function
gsub, creating a new variable
pseq. The string function
nchar computes the length of each string (the number of pitches in each plate appearance) and this is stored in the variable
plays$pseq <- gsub("[.>123N+*]", "", plays$PITCH_SEQ_TX) plays$n.pitches <- nchar(plays$pseq)
The Retrosheet game logs for the 2012 season are stored in the file “gl2012.txt” and the corresponding header file is stored in “game_log_header.csv”. We read both data files into R, creating the data frame
file.name <- paste("data/gl", season, ".txt", sep="") games <- read.csv(file.name, header=FALSE) headers <- read.csv("data/game_log_header.csv") names(games) <- names(headers)
Going back to the play-by-play data frame, we create a new variable
GAME_ID (this facilitates the merging with the game log data frame). Then we use the function
ddply (in the plyr package) to count the number of pitches in each game, creating the new variable
games$GAME_ID <- with(games, paste(HomeTeam, Date, "0", sep="")) library(plyr) game.pitches <- ddply(plays, .(GAME_ID), summarize, Pitches = sum(n.pitches))
Now we use the
merge function to merge the game.pitches and games data frames, using the matching variable
GAME_ID. The merged data frame is called
DATA. We use the
head function to display a few rows of this data frame.
DATA <- merge(game.pitches, games[, c("GAME_ID", "Duration")], by="GAME_ID") head(DATA) GAME_ID Pitches Duration 1 ANA201204060 242 142 2 ANA201204070 271 166 3 ANA201204080 331 195 4 ANA201204160 282 168 5 ANA201204170 275 166 6 ANA201204180 289 164
Now is the fun stuff. To see the relationship between the number of game pitches and time (in minutes), we construct a smoothed scatterplot (this is nice for displaying a large number of point) and overlay a best fitting line.
with(DATA, smoothScatter(Pitches, Duration)) fit <- lm(Duration ~ Pitches, data=DATA) abline(fit, lwd=3, col="red")
By displaying the variable fit, we see the best line is
DURATION = 0.2725 + 0.6069 PITCHES
So each pitch in a baseball game adds (on average) .6 of a minute (or 36 seconds) to the length of a game. But we see significant spread in the times for a given number of pitches, so obviously there are other important factors that affect the length of a game. In a future post, we’ll discuss these other factors.