Retrosheet Package, Part 2
Brian introduced the retrosheet
package last week. This week, I’ll explore further, looking at the capability of this package to download play-by-play Retrosheet data.
First I load the relevant packages I will be using.
library(retrosheet) library(Lahman) library(dplyr)
Downloading the Retrosheet Play-by-Play Data
The getRetrosheet
function in the retrosheet
will download all of the play-by-play data for a particular home team for a particular season using the “play” argument. For example, if I want all of the plays for the games played in Philadelphia during 1960, we would type
getRetrosheet("play", 2060, "PHI")
In my work, I typically want the Retrosheet data for all plays in a season. So I illustrate the use of a wrapper function below to accomplish this.
Towards this goal, I need a listing of the abbreviations of all 30 teams. I get this list by downloading the 2013 schedule (by use of the getRetrosheet
function with the “schedule” argument) and saving the unique values of VisTeam
.
S <- getRetrosheet("schedule", 2013) Teams <- unique(S$VisTeam)
I write a wrapper function that will download the play-by-play data for a specific team’s home games. The output of getRetrosheet
is a list of (typically) 81 elements, where each element corresponds to the Retrosheet data for a particular game. (Each team typically has 81 home games during the season.) Here we are primarily interested in the play
component and this function will output the play information for all 81 games, adding the game id information to the data frame. By the way, the do.call
function with rbind
is helpful for combining the play-by-play data for all 81 games.
get_team_plays <- function(team){ P <- getRetrosheet("play", 2013, team) get_plays <- function(j) data.frame(Game=P[[j]]$id[1], P[[j]]$play) do.call("rbind", lapply(1:length(P), get_plays)) }
Last, I use another application of do.call
to run the get_team_plays
function for all thirty teams — the variable all_plays
will contain the baseball plays for the entire 2013 season.
all_plays <- do.call("rbind", lapply(Teams[1:30], get_team_plays)) head(all_plays) Game inning team retroID count pitches play 1 LAN201304010 1 0 pagaa001 01 CX 4/P 2 LAN201304010 1 0 scutm001 32 BCBBCX 5/P 3 LAN201304010 1 0 sandp001 10 BX S18/G 4 LAN201304010 1 0 poseb001 22 BBFSB WP.1-2 5 LAN201304010 1 0 poseb001 32 BBFSB.>C K 6 LAN201304010 1 1 crawc002 31 CBBBX S34/G
Looking at the display of all_plays
, we see that for each play, we have the game id, the inning, the team (home or visitor), the batter id, the final pitch count of the plate appearance, the sequences of pitches (and other events like a stolen base or a wild pitch), and the code for the play. This is more concise that the large number of variables that one can get using the Chadwick programs. For example, we don’t have the number of outs, the running score, or the identity of any runners of base here. But I suppose one could write a function that could create these additional variables knowing the specific plays.
Exploring Mike Trout
Let’s use this data to explore Mike Trout’s plate appearances. I find Trout’s retrosheet id by use of the Master
data frame in the Lahman
package. I create a new data frame trout_pa
containing all plays when Trout was the batter. Also I create two new variables — pitch_seq
is the pitch sequence (with non-pitches removed), and num_pitches
is the number of pitches in the plate appearance.
trout_id <- as.character(select(filter(Master, nameFirst=="Mike", nameLast=="Trout"), retroID)) trout_pa <- filter(all_plays, retroID==trout_id) trout_pa <- mutate(trout_pa, pitch_seq=gsub("[.>123N+*]", "", pitches), num_pitches=nchar(pitch_seq))
By the way, this Trout data frame has 736 rows, and we know (from Baseball Reference) that Trout had only 716 PA in the 2013 season. Our data frame contains extra events like stolen bases, wild pitches, etc. and we’d need to remove these (by use of some text manipulation functions) if we wanted a new data frame with only Trout’s PA’s.
Trout’s 2013 Home Runs
Let’s look at Trout’s 2013 home runs. We select the rows where “HR” is contained in the play description.
trout_hr <- filter(trout_pa, grepl('HR', play) == TRUE)
Suppose we are interested in the location and type (line drive or flyball) of his home runs. We use the strsplit
function to split the play information by the “/” symbol and put this information about the home runs in the matrix HR_info
.
HR_info <- matrix(unlist(strsplit(as.character(trout_hr$play), "/")), dim(trout_hr)[1], 3, byrow=TRUE)
Here is a tally of the locations of his home runs.
table(HR_info[, 2]) 7 78 8 89 9 4 6 12 3 2
We see that Trout hit most of his home runs towards center field (8) and was a little more likely to hit to left field (7) than to right field (9).
Here is a tally of the type of home runs.
table(substr(HR_info[, 3], 1, 1)) F L 23 4
We see that most of his home runs were flyballs (F) — only a few were line drives (L).
Last, how frequently did Trout swing and miss in the 2013 season? First I remove the SB, CS, WP events from the data frame. I create a new data frame trout_batting
which removes these events and where the number of pitches he sees is at least one. I use the strsplit
function to split the pitch sequence into separate characters.
event <- grepl('SB|CS|WP', trout_pa$play) & (! grepl('B-1', trout_pa$play)) trout_batting <- filter(trout_pa, event==FALSE, num_pitches > 0) pitches <- unlist(strsplit(as.character(trout_batting$pitch_seq), ""))
Last, I find a frequency table of the pitches and I divided the sum of swings (S) by the sum of swings (S), foul balls (F or T) and in-play (X).
TP <- table(pitches) TP pitches B C F H I S T X 1238 599 432 9 39 196 19 461 TP["S"] / sum(TP["S"] + TP["F"] + TP["T"] + TP["X"]) S 0.1768953
We see that Trout missed about 17.7% of his swings.
Summing Up
I don’t have much experience with the retrosheet
package yet, but it certainly simplifies the process of accessing Retrosheet data. I suspect that there might be some issues in its use (for example, I got an error message when I tried downloading 2014 play-by-play data for particular teams), and the variables that are provided might be a bit brief for a given application. But easier data access (by use of packages like Lahman
and retrosheet
) hopefully will encourage more people to do their own baseball studies.
Recent Comments