2018 Retrosheet Data and Length of a PA

Happy new year — the catchers and pitchers report about February 12-13 — just a little over a month away.

The 2018 Retrosheet Play-by-Play Files

The Retrosheet play-by-play files for the 2018 season are currently available from https://www.retrosheet.org/game.htm . Since I know some readers have struggled with accessing these files, I thought it would be good to review the process of accessing this data and then illustrate a brief analysis using Retrosheet data. In particular, I’ll explore the value (from a batter’s perspective) of extending a plate appearance by fouling the ball off. Can we measure (using some reasonable measure of performance) the value of a batter extending the plate appearance?

Obtaining the Retrosheet Files

Almost five years ago (where does the time go?) I described the process of downloading Retrosheet play-by-play files by obtaining the Chadwick files and downloading several R functions. I also have a webpage that explains the process for downloading the 2018 season files. I have recently tried this out and it seemed to work fine on my Macintosh. Often folks struggle when the working directory is not correctly identified and the R functions won’t find the Retrosheet files.

Length of a Plate Appearance

After reading in the Retrosheet data, I focus on the variable “PITCH_SEQ_TX” which gives the sequence of balls and strikes (and other events like throws to first base) for a plate appearance. I create a new variable “pseq” that contains only the pitch results and also a variable “pseq_length” that computes the number of pitches.

In passing, we should recognize the longest plate appearance by Brandon Belt on April 22 against Anaheim. This 21-pitch PA can be seen on YouTube here. Looking at the pseq variable, we note that Brandon fouled off 11 consecutive pitches. Note that on the 21st pitch, the EVENT_CD variable is 2 which happened to be a flyout for this PA. So in this case, Brandon didn’t benefit with seeing 21 pitches.

How Does the Value of a PA Depend on the Number of Pitches?

A basic measure of the value of a PA is the run value, the difference in the runs expectancy after and before the PA plus the number of runs scored in the play. Below I graph the mean run value as a function of the number of pitches.

  • PAs that end in 1 or 2 pitches tend to be favorable to the hitter.
  • PAs in 3 to 5 pitches tend to favor the pitcher.
  • It is interesting that the mean run value is approximately a linear function of the number of pitches from 4 to 9.

The Count

The reader should quickly realize that the number of pitches is not the key variable in this study. Rather, the current pitch count tells a fan about the advantage to the batter or the pitcher and the value of the PA really depends on the count. So I enhance this plot by displaying the pitch count that ends the plate appearance. What do we see?

  • There is a slight positive advantage to the hitters that end in 0-0, 1-0, 0-1, 2-0, 1-1, 2-1, and 3-2 counts.
  • There are advantages to the pitcher on the 2-strike counts (except 3-2). Of course many of these 2-strike counts lead to strikeouts.
  • There are big advantages to the hitter on 3-ball counts (think walk).
  • Remember our original question was whether extending the PA benefits the hitter. Extending the count does not appear to help the batter on a 3-2 count. But there appears to be some benefit for, say 2-2 counts. For a 2-2 count, the mean run value steadily increases as a function of the number of pitches.

Other Measures of Performance

The basic patterns above will follow for alternative measures of hitting performance. For example, below I explore how the home run rate (HR divided by PA) depends on the final pitch count and the number of pitches. We see …

  • Home runs are most likely to be hit on a 2-0 count.
  • Home runs are uncommon on two-strike counts and 3-0 counts.
  • This graph shows the value of extending the PA. On two-strike outs, the home run rate steadily increases as the number of pitches goes from 4 to 9.


All of the R code for this exercise can be found on my GitHubGist site. If you are just starting work with R or you are having problems accessing the Retrosheet files, I’d suggest that you download some of my complete Retrosheet files from http://www-math.bgsu.edu/~albert/retrosheet/ and you can try this out for some previous seasons.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: