In our book, Max and I describe the process of downloading Retrosheet play-by-play data (Appendix A.1) and computing run values of all events (Chapter 5). Here we illustrate some updated functions for downloading the data and computing the run values.
Downloading the Data
Our book assumes that you have a Windows environment, but now I use a Mac laptop and so I was motivated to adapt our downloading instructions for a Mac.
1. If you have a Mac, you need to install the Chadwick files. Here is an excellent description of how to install the Chadwick Software on a Mac.
2. In the current working directory, create a “download.folder”, and two subfolders “unzipped” and “zipped” inside “download.folder”. From the book script and data web site, download the file fields.csv
and put this file in the “upzipped” folder. (For a Windows computer, you need to have the Chadwick cwevent.exe inside the “upzipped” folder.)
3. I updated a function parse.retrosheet2.pbp
(a slight modification of the one provided in our book) that downloads the Retrosheet play-by-play and roster files for a particular season and uses the Chadwick program to extract the data and creates two data files. What’s new is that the function will work for both Mac and Windows. You can see the function here — you can read this function into R by typing …
library(devtools) source_gist(8892981) # reads in parse.retrosheet2.pbp
Computing Runs Values
1. In Chapter 5, I describe how to compute the run values for all plays. I have put all of the R code into a new function compute.runs.expectancy
which is found here and can be sourced into R:
source_gist(8892999) # reads in compute.runs.expectancy function
2. Now we’re ready to download, say all of the 2013 season play-by-play data, and compute the runs values.
# reads in 2013 retrosheet files, creating two new csv files parse.retrosheet2.pbp(2013) # move to folder containing all2013.csv, roster2013.csv files setwd("download.folder/unzipped") # computes runs values and other variables for all states d2013 <- compute.runs.expectancy(2013)
3. The data frame d2013
contains all of the 2013 play data. I wrote a short function runs.expectancy
to compute the expected runs in the remainder of the inning for all 24 bases/outs situation.
runs.expectancy <- function(data){ RUNS <- with(data, aggregate(RUNS.ROI, list(STATE), mean)) RUNS$Outs <- substr(RUNS$Group, 5, 5) RUNS <- RUNS[order(RUNS$Outs), ] RUNS.out <- matrix(round(RUNS$x, 2), 8, 3) dimnames(RUNS.out)[[2]] <- c("0 outs", "1 out", "2 outs") dimnames(RUNS.out)[[1]] <- c("000", "001", "010", "011", "100", "101", "110", "111") RUNS.out } # illustrate this for 2013 play-by-play data runs.expectancy(d2013) 0 outs 1 out 2 outs 000 0.46 0.24 0.09 001 1.31 0.92 0.35 010 1.09 0.62 0.30 011 2.00 1.37 0.55 100 0.82 0.50 0.21 101 1.80 1.11 0.49 110 1.39 0.84 0.40 111 2.17 1.56 0.72
The 2013 expected runs are similar to those found from 2011 season data in Chapter 5 of the book. In the next post, I’ll use this play data frame to see which players last season were best in performing in the clutch.
Jim, I appreciate all your efforts to bring the layperson up to speed in creating his own Retrosheet database, and I’m definitely close but I got stuck installing the Chadwick files. I was following the pitch by pitch article you link to in step 1, but in step 6 of that article, when typing in ./configure to Terminal, it gave me:
hnt:~ paulsingman$ cd Downloads/chadwick-0.6.3
hnt:chadwick-0.6.3 paulsingman$ ./configure
checking for a BSD-compatible install… /usr/bin/install -c
checking whether build environment is sane… yes
/Users/paulsingman/Downloads/chadwick-0.6.3/missing: Unknown `–is-lightweight’ option
Try `/Users/paulsingman/Downloads/chadwick-0.6.3/missing –help’ for more information
configure: WARNING: ‘missing’ script is too old or missing
checking for a thread-safe mkdir -p… ./install-sh -c -d
checking for gawk… no
checking for mawk… no
checking for nawk… no
checking for awk… awk
checking whether make sets $(MAKE)… no
checking whether make supports nested variables… no
checking for gcc… no
checking for cc… no
checking for cl.exe… no
configure: error: in `/Users/paulsingman/Downloads/chadwick-0.6.3′:
configure: error: no acceptable C compiler found in $PATH
See `config.log’ for more details
hnt:chadwick-0.6.3 paulsingman$
Any help you can offer on this?
I figured out my problem, had to download Xcode.
I also have a problem when I get to step 6. Changing the directory works fine, but then i get “-bash: ./configure: No such file or directory”
I should also say that I have very little idea what any of these steps mean, so please don’t assume any knowledge on my part.
Any solution on the problem with ./configure? I am getting the same error?
The problems people have tend to be with installing the chadwick files on a Mac. I just followed the instructions that were given in that particular web page and things worked fine. Once you have the chadwick files, my R functions should work.
I think I figured out that my issue was that I had downloaded the wrong file. If I remember correctly, I downloaded chadwick-0.6.3.zip, rather than chadwick-0.6.3.tar.gz
Andrew, why don’t you send your R code to me at albertcb1@gmail.com and I can probably figure out your problem. Jim
Great, thank you so much. Just sent the email.
I’m having trouble with an “fields.csv” error. The file is not anywhere to be found… ps total newbie but having a blast learning!!!
> d2013 <- compute.runs.expectancy(2013)
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
cannot open file 'all2013.csv': No such file or directory
Called from: read.csv(data.file, header = FALSE)
In order for that compute.runs.expectancy function to work, you need to have the all2013.csv
and fields.csv files in the current working directory. The parse.retrosheet2.pbp function
downloads the data from Retrosheet and creates the all2013.csv file. Maybe you have the
right files, but R is looking the wrong place.
Thanks, I have the all2013.csv and roster2013.csv. No fields.csv… I used the parse.retrosheet2.pbp function to get the data and set the wd to the right place. The only thing I’m missing is the fields.csv.
You can get the fields.csv file from https://github.com/maxtoki/baseball_R/tree/master/data
Thanks! That worked, I promptly noticed that the link is in the article after your last response… PS, I love your book “Teaching Statistics using Baseball” just got a copy a few days ago, good stuff.
I download chadwick and have the chadwick file in my working directory. I use source to direct to my parse.retrosheet2.pbp as well. However, I got this error:
Error in download.file(url = paste(“http://www.retrosheet.org/events”, :
cannot open destfile ‘download.folder/zipped/2013eve.zip’, reason ‘No such file or directory’
I think the url is wrong. Can you please give me some advice to overcome this error?
Garret, it seems that this is a directory issue. Check that in your current working directory, you have a download.folder, and inside that folder you have a zipped folder. Typically, I’m not in the right directory when I get an error message like that. Otherwise (assuming you have your chadwick program installed), it should work fine. In fact, I just downloaded the 2014 season data today.
I don’t know if you will see this, as it is almost 4 years later, but I’m getting the same error. I changed my directories with the download.folder (and others) multiple times just to be sure. It’s driving, me insane that I don’t know why it won’t work.
what happens if I want to download more than one season at a time?
Hello Jim, I am sorry to be asking about this so many years after your post, but I am having the following issue whenever I try to parse any season.
> parse.retrosheet2.pbp(2018)
trying URL ‘http://www.retrosheet.org/events/2018eve.zip’
Content type ‘application/zip’ length 2490677 bytes (2.4 MB)
downloaded 2.4 MB
Chadwick expanded event descriptor, version 0.7.0
Type ‘cwevent -h’ for help.
Copyright (c) 2002-2017
Dr T L Turocy, Chadwick Baseball Bureau (ted.turocy@gmail.com)
This is free software, subject to the terms of the GNU GPL license.
[Processing file 2018*.EV*.]
Warning: could not open file ‘2018*.EV*’
>
After this, the ‘roster.csv’ is created without any issue, but ‘all.csv’ comes up empty.
Thank in advance!
Eliso:
Sorry, I’m not sure what the problem is. It appears that you have downloaded the Chadwick files. I would check that you actually have downloaded the Retrosheet event files — when they are unzipped you should have 30 files in the unzipped folder. At the command line, you can check if you are able to process a single event file by the cwevent function.
If you want to see a sample Retrosheet season file, I have several in the folder http://www-math.bgsu.edu/~albert/retrosheet/ (these are in Rdata format).
Good luck.
Jim
An update: I found that downgrading from Chadwick v0.7 to v0.6 did the trick. Apparently, v0.6 supports wildcard expansion in command arguments, but v0.7 does not?
Hey Eliseo: I found that by downgrading my version of Chadwick from v0.7 to v0.6, I fixed my issue. Hope this helps!
Hello guys, I’ve just tried what John suggested and it also did the trick for me. Thanks a lot!
Am wanting to apply the Sosa/McGwire exercise to look at Rickey Henderson’s 1982 season and his thievery. When working with the function I get the all1982.csv file generated ok but then it stalls and gives me the below error message. Any thoughts?
Error in write_csv(., path = file.path(“retrosheet”, “unzipped”, paste0(“roster”, :
could not find function “write_csv”
Rob, it appears that you don’t have the readr package loaded that has the write_csv() function. Jim
That was it. You are a saint, Jim. Thanks.