Downloading Retrosheet data and runs expectancy

In our book, Max and I describe the process of downloading Retrosheet play-by-play data (Appendix A.1) and computing run values of all events (Chapter 5). Here we illustrate some updated functions for downloading the data and computing the run values.

Downloading the Data
Our book assumes that you have a Windows environment, but now I use a Mac laptop and so I was motivated to adapt our downloading instructions for a Mac.

1.  If you have a Mac, you need to install the Chadwick files.  Here is an excellent description of how to install the Chadwick Software on a Mac.

2.  In the current working directory, create a “download.folder”, and two subfolders “unzipped” and “zipped” inside “download.folder”.  From the book script and data web site, download the file fields.csv and put this file in the “upzipped” folder.  (For a Windows computer, you need to have the Chadwick cwevent.exe inside the “upzipped” folder.)

3.  I updated a function parse.retrosheet2.pbp (a slight modification of the one provided in our book) that downloads the Retrosheet play-by-play and roster files for a particular season and uses the Chadwick program to extract the data and creates two data files. What’s new is that the function will work for both Mac and Windows. You can see the function here — you can read this function into R by typing …

library(devtools)
source_gist(8892981)  # reads in parse.retrosheet2.pbp 

Computing Runs Values
1. In Chapter 5, I describe how to compute the run values for all plays. I have put all of the R code into a new function compute.runs.expectancy which is found here and can be sourced into R:

source_gist(8892999)  # reads in compute.runs.expectancy function

2. Now we’re ready to download, say all of the 2013 season play-by-play data, and compute the runs values.

# reads in 2013 retrosheet files, creating two new csv files
parse.retrosheet2.pbp(2013)
# move to folder containing all2013.csv, roster2013.csv files 
setwd("download.folder/unzipped")
# computes runs values and other variables for all states
d2013 <- compute.runs.expectancy(2013)

3. The data frame d2013 contains all of the 2013 play data. I wrote a short function runs.expectancy to compute the expected runs in the remainder of the inning for all 24 bases/outs situation.

runs.expectancy <- function(data){
  RUNS <- with(data, aggregate(RUNS.ROI, list(STATE), mean))
  RUNS$Outs <- substr(RUNS$Group, 5, 5)
  RUNS <- RUNS[order(RUNS$Outs), ]  
  RUNS.out <- matrix(round(RUNS$x, 2), 8, 3)
  dimnames(RUNS.out)[[2]] <- c("0 outs", "1 out", "2 outs")
  dimnames(RUNS.out)[[1]] <- c("000", "001", "010", 
                               "011", "100", "101", "110", "111")
  RUNS.out
}
# illustrate this for 2013 play-by-play data
runs.expectancy(d2013)
    0 outs 1 out 2 outs
000   0.46  0.24   0.09
001   1.31  0.92   0.35
010   1.09  0.62   0.30
011   2.00  1.37   0.55
100   0.82  0.50   0.21
101   1.80  1.11   0.49
110   1.39  0.84   0.40
111   2.17  1.56   0.72

The 2013 expected runs are similar to those found from 2011 season data in Chapter 5 of the book. In the next post, I’ll use this play data frame to see which players last season were best in performing in the clutch.

Advertisements

15 responses

  1. Jim, I appreciate all your efforts to bring the layperson up to speed in creating his own Retrosheet database, and I’m definitely close but I got stuck installing the Chadwick files. I was following the pitch by pitch article you link to in step 1, but in step 6 of that article, when typing in ./configure to Terminal, it gave me:

    hnt:~ paulsingman$ cd Downloads/chadwick-0.6.3
    hnt:chadwick-0.6.3 paulsingman$ ./configure
    checking for a BSD-compatible install… /usr/bin/install -c
    checking whether build environment is sane… yes
    /Users/paulsingman/Downloads/chadwick-0.6.3/missing: Unknown `–is-lightweight’ option
    Try `/Users/paulsingman/Downloads/chadwick-0.6.3/missing –help’ for more information
    configure: WARNING: ‘missing’ script is too old or missing
    checking for a thread-safe mkdir -p… ./install-sh -c -d
    checking for gawk… no
    checking for mawk… no
    checking for nawk… no
    checking for awk… awk
    checking whether make sets $(MAKE)… no
    checking whether make supports nested variables… no
    checking for gcc… no
    checking for cc… no
    checking for cl.exe… no
    configure: error: in `/Users/paulsingman/Downloads/chadwick-0.6.3′:
    configure: error: no acceptable C compiler found in $PATH
    See `config.log’ for more details
    hnt:chadwick-0.6.3 paulsingman$

    Any help you can offer on this?

  2. I figured out my problem, had to download Xcode.

  3. I also have a problem when I get to step 6. Changing the directory works fine, but then i get “-bash: ./configure: No such file or directory”
    I should also say that I have very little idea what any of these steps mean, so please don’t assume any knowledge on my part.

    1. Any solution on the problem with ./configure? I am getting the same error?

      1. The problems people have tend to be with installing the chadwick files on a Mac. I just followed the instructions that were given in that particular web page and things worked fine. Once you have the chadwick files, my R functions should work.

      2. I think I figured out that my issue was that I had downloaded the wrong file. If I remember correctly, I downloaded chadwick-0.6.3.zip, rather than chadwick-0.6.3.tar.gz

  4. Andrew, why don’t you send your R code to me at albertcb1@gmail.com and I can probably figure out your problem. Jim

    1. Great, thank you so much. Just sent the email.

  5. I’m having trouble with an “fields.csv” error. The file is not anywhere to be found… ps total newbie but having a blast learning!!!
    > d2013 <- compute.runs.expectancy(2013)
    Error in file(file, "rt") : cannot open the connection
    In addition: Warning message:
    In file(file, "rt") :
    cannot open file 'all2013.csv': No such file or directory
    Called from: read.csv(data.file, header = FALSE)

    1. In order for that compute.runs.expectancy function to work, you need to have the all2013.csv
      and fields.csv files in the current working directory. The parse.retrosheet2.pbp function
      downloads the data from Retrosheet and creates the all2013.csv file. Maybe you have the
      right files, but R is looking the wrong place.

      1. Thanks, I have the all2013.csv and roster2013.csv. No fields.csv… I used the parse.retrosheet2.pbp function to get the data and set the wd to the right place. The only thing I’m missing is the fields.csv.

    1. Thanks! That worked, I promptly noticed that the link is in the article after your last response… PS, I love your book “Teaching Statistics using Baseball” just got a copy a few days ago, good stuff.

  6. I download chadwick and have the chadwick file in my working directory. I use source to direct to my parse.retrosheet2.pbp as well. However, I got this error:
    Error in download.file(url = paste(“http://www.retrosheet.org/events”, :
    cannot open destfile ‘download.folder/zipped/2013eve.zip’, reason ‘No such file or directory’

    I think the url is wrong. Can you please give me some advice to overcome this error?

    1. Garret, it seems that this is a directory issue. Check that in your current working directory, you have a download.folder, and inside that folder you have a zipped folder. Typically, I’m not in the right directory when I get an error message like that. Otherwise (assuming you have your chadwick program installed), it should work fine. In fact, I just downloaded the 2014 season data today.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: