Length of baseball games

As we know, baseball is no longer America’s game — I believe baseball has been surpassed in popularity by American football. One problem is that MLB games are long — games in the 2012 season were significantly longer than the games twenty years ago. In this post, we’re going to try to learn more about the variables that control the length of a baseball game. In doing so, we’ll illustrate merging Retrosheet play-by-play data with Retrosheet game logs.

A baseball game is essentially a sequence of pitches, so I would think the time of a game would be strongly related to the number of pitches.

The number of pitches in each game during the 2012 season can be obtained using the Retrosheet play-by-play files. I assume that a single text file “all2012.csv” has been saved in a “data” folder and a “fields.csv” file in this same folder contains the names of all of the variables. (We explain this downloading process in Appendix A of our book.) We read these files into R and save the play-by-plays in a data frame called plays.

season <- 2012
file.name <- paste("data/all", season, ".csv", sep="")
plays <- read.csv(file.name, header=FALSE)
fields <- read.csv("data/fields.csv")
names(plays) <- fields[, "Header"]

The variable PITCH_SEQ_TX gives the pitch-by-pitch sequence and other events like pickoff attempts and steals for each play. We remove all non-pitches from this variable using the function gsub, creating a new variable pseq. The string function nchar computes the length of each string (the number of pitches in each plate appearance) and this is stored in the variable n.pitches.

plays$pseq <- gsub("[.>123N+*]", "", plays$PITCH_SEQ_TX)
plays$n.pitches <- nchar(plays$pseq)

The Retrosheet game logs for the 2012 season are stored in the file “gl2012.txt” and the corresponding header file is stored in “game_log_header.csv”. We read both data files into R, creating the data frame games.

file.name <- paste("data/gl", season, ".txt", sep="")
games <- read.csv(file.name, header=FALSE)
headers <- read.csv("data/game_log_header.csv")
names(games) <- names(headers)

Going back to the play-by-play data frame, we create a new variable GAME_ID (this facilitates the merging with the game log data frame). Then we use the function ddply (in the plyr package) to count the number of pitches in each game, creating the new variable Pitches.

games$GAME_ID <- with(games, paste(HomeTeam, Date, "0", sep=""))
library(plyr)
game.pitches <- ddply(plays, .(GAME_ID), summarize,
                      Pitches = sum(n.pitches))

Now we use the merge function to merge the game.pitches and games data frames, using the matching variable GAME_ID. The merged data frame is called DATA. We use the head function to display a few rows of this data frame.

DATA <- merge(game.pitches, 
              games[, c("GAME_ID", "Duration")], 
              by="GAME_ID")
head(DATA)
       GAME_ID Pitches Duration
1 ANA201204060     242      142
2 ANA201204070     271      166
3 ANA201204080     331      195
4 ANA201204160     282      168
5 ANA201204170     275      166
6 ANA201204180     289      164

Now is the fun stuff. To see the relationship between the number of game pitches and time (in minutes), we construct a smoothed scatterplot (this is nice for displaying a large number of point) and overlay a best fitting line.

with(DATA, smoothScatter(Pitches, Duration))
fit <- lm(Duration ~ Pitches, data=DATA)
abline(fit, lwd=3, col="red")

pitches

By displaying the variable fit, we see the best line is

DURATION = 0.2725 + 0.6069 PITCHES

So each pitch in a baseball game adds (on average) .6 of a minute (or 36 seconds) to the length of a game. But we see significant spread in the times for a given number of pitches, so obviously there are other important factors that affect the length of a game. In a future post, we’ll discuss these other factors.

Advertisements

2 responses

  1. I would like to look at this with some aspect of balls, strikes, called strikes etc. However, Retrosheet doesn’t explain the use of “>”. Do you know what “>” means in the context of a pitch sequence?

  2. We describe all of these pitch symbols on page 166 of our book. Specifically, “>” means a runner is going on the pitch.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: