Downloading Retrosheet data and runs expectancy

In our book, Max and I describe the process of downloading Retrosheet play-by-play data (Appendix A.1) and computing run values of all events (Chapter 5). Here we illustrate some updated functions for downloading the data and computing the run values.

Downloading the Data
Our book assumes that you have a Windows environment, but now I use a Mac laptop and so I was motivated to adapt our downloading instructions for a Mac.

1.  If you have a Mac, you need to install the Chadwick files.  Here is an excellent description of how to install the Chadwick Software on a Mac.

2.  In the current working directory, create a “download.folder”, and two subfolders “unzipped” and “zipped” inside “download.folder”.  From the book script and data web site, download the file fields.csv and put this file in the “upzipped” folder.  (For a Windows computer, you need to have the Chadwick cwevent.exe inside the “upzipped” folder.)

3.  I updated a function parse.retrosheet2.pbp (a slight modification of the one provided in our book) that downloads the Retrosheet play-by-play and roster files for a particular season and uses the Chadwick program to extract the data and creates two data files. What’s new is that the function will work for both Mac and Windows. You can see the function here — you can read this function into R by typing …

library(devtools)
source_gist(8892981)  # reads in parse.retrosheet2.pbp 

Computing Runs Values
1. In Chapter 5, I describe how to compute the run values for all plays. I have put all of the R code into a new function compute.runs.expectancy which is found here and can be sourced into R:

source_gist(8892999)  # reads in compute.runs.expectancy function

2. Now we’re ready to download, say all of the 2013 season play-by-play data, and compute the runs values.

# reads in 2013 retrosheet files, creating two new csv files
parse.retrosheet2.pbp(2013)
# move to folder containing all2013.csv, roster2013.csv files 
setwd("download.folder/unzipped")
# computes runs values and other variables for all states
d2013 <- compute.runs.expectancy(2013)

3. The data frame d2013 contains all of the 2013 play data. I wrote a short function runs.expectancy to compute the expected runs in the remainder of the inning for all 24 bases/outs situation.

runs.expectancy <- function(data){
  RUNS <- with(data, aggregate(RUNS.ROI, list(STATE), mean))
  RUNS$Outs <- substr(RUNS$Group, 5, 5)
  RUNS <- RUNS[order(RUNS$Outs), ]  
  RUNS.out <- matrix(round(RUNS$x, 2), 8, 3)
  dimnames(RUNS.out)[[2]] <- c("0 outs", "1 out", "2 outs")
  dimnames(RUNS.out)[[1]] <- c("000", "001", "010", 
                               "011", "100", "101", "110", "111")
  RUNS.out
}
# illustrate this for 2013 play-by-play data
runs.expectancy(d2013)
    0 outs 1 out 2 outs
000   0.46  0.24   0.09
001   1.31  0.92   0.35
010   1.09  0.62   0.30
011   2.00  1.37   0.55
100   0.82  0.50   0.21
101   1.80  1.11   0.49
110   1.39  0.84   0.40
111   2.17  1.56   0.72

The 2013 expected runs are similar to those found from 2011 season data in Chapter 5 of the book. In the next post, I’ll use this play data frame to see which players last season were best in performing in the clutch.

29 responses

  1. Jim, I appreciate all your efforts to bring the layperson up to speed in creating his own Retrosheet database, and I’m definitely close but I got stuck installing the Chadwick files. I was following the pitch by pitch article you link to in step 1, but in step 6 of that article, when typing in ./configure to Terminal, it gave me:

    hnt:~ paulsingman$ cd Downloads/chadwick-0.6.3
    hnt:chadwick-0.6.3 paulsingman$ ./configure
    checking for a BSD-compatible install… /usr/bin/install -c
    checking whether build environment is sane… yes
    /Users/paulsingman/Downloads/chadwick-0.6.3/missing: Unknown `–is-lightweight’ option
    Try `/Users/paulsingman/Downloads/chadwick-0.6.3/missing –help’ for more information
    configure: WARNING: ‘missing’ script is too old or missing
    checking for a thread-safe mkdir -p… ./install-sh -c -d
    checking for gawk… no
    checking for mawk… no
    checking for nawk… no
    checking for awk… awk
    checking whether make sets $(MAKE)… no
    checking whether make supports nested variables… no
    checking for gcc… no
    checking for cc… no
    checking for cl.exe… no
    configure: error: in `/Users/paulsingman/Downloads/chadwick-0.6.3′:
    configure: error: no acceptable C compiler found in $PATH
    See `config.log’ for more details
    hnt:chadwick-0.6.3 paulsingman$

    Any help you can offer on this?

  2. I figured out my problem, had to download Xcode.

  3. I also have a problem when I get to step 6. Changing the directory works fine, but then i get “-bash: ./configure: No such file or directory”
    I should also say that I have very little idea what any of these steps mean, so please don’t assume any knowledge on my part.

    1. Any solution on the problem with ./configure? I am getting the same error?

      1. The problems people have tend to be with installing the chadwick files on a Mac. I just followed the instructions that were given in that particular web page and things worked fine. Once you have the chadwick files, my R functions should work.

      2. I think I figured out that my issue was that I had downloaded the wrong file. If I remember correctly, I downloaded chadwick-0.6.3.zip, rather than chadwick-0.6.3.tar.gz

  4. Andrew, why don’t you send your R code to me at albertcb1@gmail.com and I can probably figure out your problem. Jim

    1. Great, thank you so much. Just sent the email.

  5. I’m having trouble with an “fields.csv” error. The file is not anywhere to be found… ps total newbie but having a blast learning!!!
    > d2013 <- compute.runs.expectancy(2013)
    Error in file(file, "rt") : cannot open the connection
    In addition: Warning message:
    In file(file, "rt") :
    cannot open file 'all2013.csv': No such file or directory
    Called from: read.csv(data.file, header = FALSE)

    1. In order for that compute.runs.expectancy function to work, you need to have the all2013.csv
      and fields.csv files in the current working directory. The parse.retrosheet2.pbp function
      downloads the data from Retrosheet and creates the all2013.csv file. Maybe you have the
      right files, but R is looking the wrong place.

      1. Thanks, I have the all2013.csv and roster2013.csv. No fields.csv… I used the parse.retrosheet2.pbp function to get the data and set the wd to the right place. The only thing I’m missing is the fields.csv.

    1. Thanks! That worked, I promptly noticed that the link is in the article after your last response… PS, I love your book “Teaching Statistics using Baseball” just got a copy a few days ago, good stuff.

  6. I download chadwick and have the chadwick file in my working directory. I use source to direct to my parse.retrosheet2.pbp as well. However, I got this error:
    Error in download.file(url = paste(“http://www.retrosheet.org/events”, :
    cannot open destfile ‘download.folder/zipped/2013eve.zip’, reason ‘No such file or directory’

    I think the url is wrong. Can you please give me some advice to overcome this error?

    1. Garret, it seems that this is a directory issue. Check that in your current working directory, you have a download.folder, and inside that folder you have a zipped folder. Typically, I’m not in the right directory when I get an error message like that. Otherwise (assuming you have your chadwick program installed), it should work fine. In fact, I just downloaded the 2014 season data today.

      1. I don’t know if you will see this, as it is almost 4 years later, but I’m getting the same error. I changed my directories with the download.folder (and others) multiple times just to be sure. It’s driving, me insane that I don’t know why it won’t work.

  7. what happens if I want to download more than one season at a time?

  8. Eliseo Avramides | Reply

    Hello Jim, I am sorry to be asking about this so many years after your post, but I am having the following issue whenever I try to parse any season.

    > parse.retrosheet2.pbp(2018)
    trying URL ‘http://www.retrosheet.org/events/2018eve.zip’
    Content type ‘application/zip’ length 2490677 bytes (2.4 MB)
    downloaded 2.4 MB

    Chadwick expanded event descriptor, version 0.7.0
    Type ‘cwevent -h’ for help.
    Copyright (c) 2002-2017
    Dr T L Turocy, Chadwick Baseball Bureau (ted.turocy@gmail.com)
    This is free software, subject to the terms of the GNU GPL license.

    [Processing file 2018*.EV*.]
    Warning: could not open file ‘2018*.EV*’
    >

    After this, the ‘roster.csv’ is created without any issue, but ‘all.csv’ comes up empty.

    Thank in advance!

    1. Eliso:

      Sorry, I’m not sure what the problem is. It appears that you have downloaded the Chadwick files. I would check that you actually have downloaded the Retrosheet event files — when they are unzipped you should have 30 files in the unzipped folder. At the command line, you can check if you are able to process a single event file by the cwevent function.

      If you want to see a sample Retrosheet season file, I have several in the folder http://www-math.bgsu.edu/~albert/retrosheet/ (these are in Rdata format).

      Good luck.

      Jim

      1. An update: I found that downgrading from Chadwick v0.7 to v0.6 did the trick. Apparently, v0.6 supports wildcard expansion in command arguments, but v0.7 does not?

    2. Hey Eliseo: I found that by downgrading my version of Chadwick from v0.7 to v0.6, I fixed my issue. Hope this helps!

      1. Eliseo Avramides

        Hello guys, I’ve just tried what John suggested and it also did the trick for me. Thanks a lot!

  9. Am wanting to apply the Sosa/McGwire exercise to look at Rickey Henderson’s 1982 season and his thievery. When working with the function I get the all1982.csv file generated ok but then it stalls and gives me the below error message. Any thoughts?

    Error in write_csv(., path = file.path(“retrosheet”, “unzipped”, paste0(“roster”, :
    could not find function “write_csv”

    1. Rob, it appears that you don’t have the readr package loaded that has the write_csv() function. Jim

      1. That was it. You are a saint, Jim. Thanks.

  10. I’m having issues downloading the Chadwick files. It’s being flagged as harmful to my computer. Is this still active? I’m trying to get 2022 pbp data. Thank you.

    1. Here is the link for the Chadwick files. https://github.com/chadwickbureau/chadwick
      I think they should be okay to download on your computer.

      1. Thank you. I don’t see “cwevent.exe” anywhere in the Chadwick zipped folder I download from Github. There is “cwevent.rst” but not .exe. I’m still a novice with R, but I downloaded the 2022 pbp zip file but the book isn’t clear where to place it. Put it in the zipped folder in my working directory?

        I’ll keep playing with it and asking ChatGPT for troubleshooting but I’m just not following the book or instructions above very well. I won’t take up too much of your time. I’m loving the book though. This is the only issue I’ve had so far.

      2. I haven’t installed these files recently, here are some suggestions. I have a Mac, so I downloaded the tar.gz version of the files from https://github.com/chadwickbureau/chadwick/releases. If you are windows, then I would download the zip version. After you have unzipped the files, you should find instructions for installation. After I installed, then the files like cwevent.exe were available.

Leave a reply to Jim Albert Cancel reply