retrosheet Package and Comparing Count Rates

Introduction

In previous posts and here, I describe the process of downloading and reading Retrosheet play-by-play data into R. The process requires the installation of the Chadwick files. In particular, the Chadwick cwevent function is used at the terminal level to parse the raw Retrosheet files. For some users, this approach is problematic due to issues getting the Chadwick functions to run. The retrosheet package provides an alternative way of obtaining these Retrosheet files. In this post, we describe how one uses the retrosheet package to download the play-by-play files. To illustrate using this data, we compare rates of different counts for the 1995 and 2023 seasons.

Using the retrosheet Package

Suppose you are interested in collecting all plays in a particular baseball season. And you don’t have access to the Chadwick functions.

Specifically, suppose you wish to collect the plays for all games played by the Phillies (team abbreviation “PHI”) in the recent 2023 season. You load the retrosheet package and use the function get_retrosheet() with arguments “play”, 2023 and “PHI”:

library(retrosheet)
d <- get_retrosheet("play", 2023, "PHI")

The output d is a list of 81 elements, each element corresponding to one of the 81 home games played by the Phillies in this season. Here are the components of the first element.

names(d[[1]])
[1] "id"      "version" "info"    "start"   "play"    "com"    
[7] "sub"     "data"

The variable id contains the game id, info gives basic information about the game, start gives the starting players at each position for each team, and data gives the number of earned runs allowed for each pitcher in the game. The play component gives the outcomes of each plate appearance (see below). The variable retroID gives the batter id, count is the ending count, pitches gives the sequence of pitches, and play is the outcome of the PA.

head(d[[1]]$play)
  inning team  retroID count pitches     play
1      1    0 indij001     2    CCFX     9/F9
2      1    0 friet001     1      CX 2/P2F/FL
3      1    0 fralj001    32  CTBBBS        K
4      1    1 turnt001    12    CFBS        K
5      1    1 schwk001    12    SSBS        K
6      1    1 realj001    12    CCBS        K

Since we are interested in collecting all plays this season, we need to run this get_retrosheet() function for each of the 30 teams. I wrote a short function get_my_retrosheet() that (1) runs the get_retrosheet() function for all teams, (2) adds the game id variable, and (3) row-merges the play files into a single data frame.

Note that this retrosheet function only collects a few variables — in contrast, the Chadwick cwevent function produces many more variables such as the pitcher id, runners on base, number of outs, batter and pitcher sides, etc. By use of additional programming, one could obtain the runners on base and number of outs variables by use of the play variable.

Comparing Count Rates for Two Seasons

One pitcher I have always admired in baseball history is Greg Maddux. He was remarkably efficient in the sense that he rarely walked batters and threw shutouts with under 100 pitches. Here is an article about Maddux’s remarkable complete game with only 77 pitches. He rarely fell behind in the count and usually had two-strike counts on the hitter. He had a remarkable streak of 72 1/3 innings without issuing a walk — I believe I attended the Braves game in August 2001 where the walk-free streak ended for Maddux.

But one of Maddux’s best seasons was 1995 and perhaps the “count environment” in 1995 was very different than the count environment in the recent 2023 season. Maybe in 1995 hitters were less patient and were more likely to end the plate appearance in a few pitches. That raises the question: How did the rates of different plate appearance counts (amount 1-0, 0-1, 2-0, 1-1, 0-2, 3-0, 2-1, 1-2, 3-1, 2-2, 3-2) compare between 1995 and 2023?

Here’s an outline of my R work to address this question.

  1. Using the retrosheet package, I collected all plays for the 1995 and 2023 seasons. Using my get_my_retrosheet() function, I get two data frames containing plays for the two seasons.
  2. I write a function add_count_variables() that will (1) add indicator variables c10, c01, … for the possible counts and (2) compute the percentages of PAs that have each possible count. In my function I am using a function retrosheet_add_counts() from the abdwr3edata package that goes along with the 3rd edition of Analyzing Baseball Data with R.
  3. Last I create a data frame gives the count percentage in each possible count for the 1995 and 2023 seasons. To compare the percentages, I convert each percentage to a logit (log(p) – log(1 – p)), and then look at the difference
    D = logit(2023 rate) – logit(1995 rate)

Here is a dotplot of the difference in logit rates comparing the 1995 and 2023 seasons. The red vertical line corresponds to no change in rates between the two seasons.

We see dramatic differences in the rates of counts for the two seasons. All of the two-strike counts and the 0-1 count were more likely to occur in the 2023 season compared to the 1995 season. Hitter counts such as 3-0, 2-0, and 1-0 were more likely in the 1995 season compared to 2023. The counts 3-1 and 2-1 were equally likely in the two seasons. Yes, Greg Maddux was remarkable in his pitch efficiency in 1995, but short PAs with fewer two-strike counts were more likely in 1995 than in 2023.

Got Code?

A single Quarto file doing all the R work for this particular exercise can be found on my Github Gist site. One can get html output of the R results by opening and rendering this Quarto file in RStudio. This script uses the retrosheet package to collect all of the play-by-play data for the two seasons (without using any Chadwick functions) and all of the new functions are contained in this file.

Related Posts

Back in 2015, Brian Mills and I had some posts (see here and here) on applications of an earlier release of the retrosheet package.

2 responses

  1. Great article! Can you explain why you transformed the percentages using the logit function instead of just taking the difference in percentages?

    I like the final result where you take the difference of logits and get a log odds-ratio of sorts! Very useful for comparing the counts between the two years.

    1. There is an issue with percentages — ones near 50% have more variation than those near 0 or 100. Taking a logit removes this problem so that logits of 50% have similar variability to logits of percentages close to 0 or 100.

Leave a comment