One of the most popular posts in this blog has been the one giving instructions how to download the Retrosheet play-by-play files. Unfortunately, some people have struggled using the R functions that we describe in our text and this particular post. So I thought it would be helpful to describe a simple method of getting these Retrosheet files into R. Also I wrote a short package that facilitates computing the run values and associated win probabilities for all plays in a Retrosheet dataset. Once you have the Retrosheet data with the run values and win probabilities, you can do a lot of interesting explorations. Below I illustrate exploring the win probabilities of home runs during the 2018 season.
Downloading the Retrosheet Play-by-Play Data:
Here’s a simpler method. First, one double-clicks on the Retrosheet page to download a zip file containing all of the files for a particular season. (For the 2018 season, there will be 30 files in this compressed file, one corresponding to the home games for each team.) After you unzip the archive, then you run a Chadwick program at the Terminal level (type a single line) to put all of the data into a single csv file. This csv file can then be directly read into R by say the read_csv() function in the readr package.
I describe this process at the page below:
Computing Run Expectancies and Win Probabilities
After one has downloaded the Retrosheet data, the next useful step is to compute the run expectancies and win probabilities for all plays. I made some small revisions to my R functions and put them in a new R package WinProbability. (Some description of my methods for computing the win probabilities can be found on the Part I and Part II posts.) Assuming you have downloaded Retrosheet data for a single season, one function in this package will add a header with variable names and compute the run expectancies. Another function will add the win probabilities to the Retrosheet dataset, and a third function will graph the win probabilities for a specific game of interest.
Here’s a description of installing the WinProbability package and doing these calculations:
Value of Home Runs?
To illustrate some work with the 2018 Retrosheet data, let’s explore the value of home runs hit in the 2018 season. In the runs expectancy chapter of our book, we looked at the run value of home runs — one takeaway is that the average run value of a home run is only about 1.4. Perhaps a more relevant measure of the value of a home run is the WPA or win probability added. The WPA, or more precisely the absolute value of the WPA tells us the benefit of the home run (that is, the increase in the team’s win probability) towards the ultimate goal of winning the game for the player’s team. Let’s explore the distribution of WPA across all home run hitters for the 2018 season
A reasonable graph is a scatterplot of the home run count (horizontal axis) against the average values of abs(WPA) (vertical axis) for all 2018 players. I’ve labelled some interesting points in this graph.
- Brandon Phillips and Raimel Tapia each had only one home run in 2018, but these specific home runs really had an impact. Red Sox fan readers might recall Phillips’ dramatic 9th inning two-run home run in the Sox’ 9-8 win over the Braves on September. Likewise, Ramiel Tapia’s had a grand slam for the Rockies that led to their victory over the D-Backs. Each of these home run increased their team’s win probability by more than 40%.
- D.J. LeMahieu had only 15 home runs in 2018, but he seems to stand out with respect to the average abs WPA — his home runs increased his team’s win probability by over 20% on average. This means his home runs seemed to occur at important moments during his team’s games.
- We know Khris Davis bested J.D. Martinez with respect to the home run total (48 compared to 43). It is interesting to compare the values of these home runs. Below I display the mean run value and the mean abs(WPA) for both hitters.
What is interesting is that both players tended to average 1.5 run value per home run — a little above average. But Davis’ mean abs WPA is 0.033 higher than Martinez’s value. Let’s look at this more carefully by displaying parallel dotplots of the abs WPA values for the two players. Davis has more home runs than Martinez that increase his team’s probability of winning by 0.2 or higher.
- All of the R code for this exercise can be found on my GithubGist site.
- Can we conclude that Khris Davis is a “clutch” home run hitter, in the sense that he tends to hit his home runs during clutch situations? Actually, no. All I have demonstrated is that for the 2018 season, Davis’ home runs contributed more, on average, towards his team’s victories than other players such as J.D. Martinez. It might be better said that Davis was lucky in that he was given the opportunity to hit home runs in important situations.
- Now if one could show that Davis’s home runs consistently contributed more towards team wins than other home run hitters, that would be more interesting.
- To follow up this comments, it is easy to check if this same pattern held in the previous 2017 season. Okay, Davis also had a higher mean abs(WPA) value than Martinez in 2017, so this getting more interesting. (Actually Davis was also had a higher mean abs(WPA) value in the 2016 season.)
- But even if you could show Khris Davis’ is consistently clutch in this sense, I still wouldn’t be that excited by it. This reminds me of a Bill James’ statement that a situational effect is only meaningful if we understand the process that could cause this situational effect. In this setting, it would be hard to think of some reasoning or rationale that would cause Davis to more likely hit home runs in important situations.