Monthly Archives: November, 2015

Retrosheet Package, Part 2

Brian introduced the retrosheet package last week. This week, I’ll explore further, looking at the capability of this package to download play-by-play Retrosheet data.

First I load the relevant packages I will be using.

library(retrosheet)
library(Lahman)
library(dplyr)

Downloading the Retrosheet Play-by-Play Data

The getRetrosheet function in the retrosheet will download all of the play-by-play data for a particular home team for a particular season using the “play” argument. For example, if I want all of the plays for the games played in Philadelphia during 1960, we would type

getRetrosheet("play", 2060, "PHI")

In my work, I typically want the Retrosheet data for all plays in a season. So I illustrate the use of a wrapper function below to accomplish this.

Towards this goal, I need a listing of the abbreviations of all 30 teams. I get this list by downloading the 2013 schedule (by use of the getRetrosheet function with the “schedule” argument) and saving the unique values of VisTeam .

S <- getRetrosheet("schedule", 2013)
Teams <- unique(S$VisTeam)

I write a wrapper function that will download the play-by-play data for a specific team’s home games. The output of getRetrosheet is a list of (typically) 81 elements, where each element corresponds to the Retrosheet data for a particular game. (Each team typically has 81 home games during the season.) Here we are primarily interested in the play component and this function will output the play information for all 81 games, adding the game id information to the data frame. By the way, the do.call function with rbind is helpful for combining the play-by-play data for all 81 games.

get_team_plays <- function(team){
  P <- getRetrosheet("play", 2013, team)
  get_plays <- function(j) data.frame(Game=P[[j]]$id[1], P[[j]]$play)
  do.call("rbind", lapply(1:length(P), get_plays))
}

Last, I use another application of do.call to run the get_team_plays function for all thirty teams — the variable all_plays will contain the baseball plays for the entire 2013 season.

all_plays <- do.call("rbind", lapply(Teams[1:30], get_team_plays))
head(all_plays)
Game inning team retroID count pitches play
1 LAN201304010 1 0 pagaa001 01 CX 4/P
2 LAN201304010 1 0 scutm001 32 BCBBCX 5/P
3 LAN201304010 1 0 sandp001 10 BX S18/G
4 LAN201304010 1 0 poseb001 22 BBFSB WP.1-2
5 LAN201304010 1 0 poseb001 32 BBFSB.&gt;C K
6 LAN201304010 1 1 crawc002 31 CBBBX S34/G

Looking at the display of all_plays , we see that for each play, we have the game id, the inning, the team (home or visitor), the batter id, the final pitch count of the plate appearance, the sequences of pitches (and other events like a stolen base or a wild pitch), and the code for the play. This is more concise that the large number of variables that one can get using the Chadwick programs. For example, we don’t have the number of outs, the running score, or the identity of any runners of base here. But I suppose one could write a function that could create these additional variables knowing the specific plays.

Exploring Mike Trout

Let’s use this data to explore Mike Trout’s plate appearances. I find Trout’s retrosheet id by use of the Master data frame in the Lahman package. I create a new data frame trout_pa containing all plays when Trout was the batter. Also I create two new variables — pitch_seq is the pitch sequence (with non-pitches removed), and num_pitches is the number of pitches in the plate appearance.

trout_id <- as.character(select(filter(Master, 
              nameFirst=="Mike", nameLast=="Trout"), retroID))
trout_pa <- filter(all_plays, retroID==trout_id)
trout_pa <- mutate(trout_pa,   
              pitch_seq=gsub("[.>123N+*]", "", pitches),
              num_pitches=nchar(pitch_seq))

By the way, this Trout data frame has 736 rows, and we know (from Baseball Reference) that Trout had only 716 PA in the 2013 season. Our data frame contains extra events like stolen bases, wild pitches, etc. and we’d need to remove these (by use of some text manipulation functions) if we wanted a new data frame with only Trout’s PA’s.

Trout’s 2013 Home Runs

Let’s look at Trout’s 2013 home runs. We select the rows where “HR” is contained in the play description.

trout_hr <- filter(trout_pa, grepl('HR', play) == TRUE)

Suppose we are interested in the location and type (line drive or flyball) of his home runs. We use the strsplit function to split the play information by the “/” symbol and put this information about the home runs in the matrix HR_info .

HR_info <- matrix(unlist(strsplit(as.character(trout_hr$play), "/")),
       dim(trout_hr)[1], 3, byrow=TRUE)

Here is a tally of the locations of his home runs.

table(HR_info[, 2])
7 78 8 89 9
4 6 12 3 2

We see that Trout hit most of his home runs towards center field (8) and was a little more likely to hit to left field (7) than to right field (9).

Here is a tally of the type of home runs.

table(substr(HR_info[, 3], 1, 1))
F L
23 4

We see that most of his home runs were flyballs (F) — only a few were line drives (L).

Last, how frequently did Trout swing and miss in the 2013 season? First I remove the SB, CS, WP events from the data frame. I create a new data frame trout_batting which removes these events and where the number of pitches he sees is at least one. I use the strsplit function to split the pitch sequence into separate characters.

event <- grepl('SB|CS|WP', trout_pa$play) & 
         (! grepl('B-1', trout_pa$play))
trout_batting <- filter(trout_pa, 
                    event==FALSE,
                    num_pitches > 0)
pitches <- unlist(strsplit(as.character(trout_batting$pitch_seq), ""))

Last, I find a frequency table of the pitches and I divided the sum of swings (S) by the sum of swings (S), foul balls (F or T) and in-play (X).

TP <- table(pitches)
TP
pitches
B C F H I S T X
1238 599 432 9 39 196 19 461
TP["S"] / sum(TP["S"] + TP["F"] + TP["T"] + TP["X"])
S
0.1768953

We see that Trout missed about 17.7% of his swings.

Summing Up

I don’t have much experience with the retrosheet package yet, but it certainly simplifies the process of accessing Retrosheet data. I suspect that there might be some issues in its use (for example, I got an error message when I tried downloading 2014 play-by-play data for particular teams), and the variables that are provided might be a bit brief for a given application. But easier data access (by use of packages like Lahman and retrosheet ) hopefully will encourage more people to do their own baseball studies.

Advertisements