Monthly Archives: December, 2015

Introduction to openWAR Package, Part 1

Ben Baumer and Greg Matthews have a relatively new R package openWAR that facilitates calculation of WAR based on methodology from their recently published paper in the Journal of Quantitative Analysis of Sports. See their github site for a description of installing the package.

One attractive feature of the openWAR package is that it allows for downloading of MLBAM GameDay play-by-play data. Recently, I illustrated the use of the retrosheet package to download similar data from Retrosheet. We’ll see that the data frames that we get from MLBAM GameDay contain many more variables that facilitate developing interesting analysis.

In this post, I illustrate the downloading of play-by-play data, illustrate run expectancy calculations, and use the MLBAM GameDay data to show the locations of all in-play events for specific players.

Downloading MLBAM data

Once you have loaded the openWAR package, downloading data is remarkably easy. For example, to download all GameDay data for the games in May 2015, I use the getData function with starting and end dates.

library(openWAR)
may <- getData(start="2015-05-01", end="2015-05-31")

The output is a data frame with 62 variables where each row corresponds to a plate appearance.

To get data for the entire season, I used the getData repeatedly for each month, and then I combined the data frames to get a single data frame d2015 for the whole season. (There were some problems in downloading data for specific games — when I was done I had data for 2414 games, which represented over 99 percent of all games.)

Run expectancy calculations

In Chapter 5 of Analyzing Baseball with R, we spend some time explaining how to compute run values for individual plays from the Retrosheet play by play files. Using all of the GameDay variables together with the dplyr package functions, it is easy to replicate these calculations.

First I create new variables R1B, R2B, R3B, E1B, E2B, E3B that indicate the presence of a runner on each base before and after the play.

library(dplyr)
d2015 <- mutate(d2015,
                R1B=!is.na(start1B),
                R2B=!is.na(start2B),
                R3B=!is.na(start3B),
                E1B=!is.na(end1B),
                E2B=!is.na(end2B),
                E3B=!is.na(end3B))

Using the summarize function I find the mean runs in the remainder of the inning (there already is a runsFuture variable available in the data frame).

(runs_expectancy <- summarize(group_by(d2015, 
                   startOuts, R1B, R2B, R3B),
                   Runs=mean(runsFuture)))
Source: local data frame [25 x 5]
Groups: startOuts, R1B, R2B [?]

   startOuts   R1B   R2B   R3B      Runs
       (dbl) (lgl) (lgl) (lgl)     (dbl)
1          0 FALSE FALSE FALSE 0.4735825
2          0 FALSE FALSE  TRUE 1.3833780
3          0 FALSE  TRUE FALSE 1.1054657
4          0 FALSE  TRUE  TRUE 2.0896130
5          0  TRUE FALSE FALSE 0.8602090
6          0  TRUE FALSE  TRUE 1.7195652
7          0  TRUE  TRUE FALSE 1.4765471
8          0  TRUE  TRUE  TRUE 2.2954876
9          1 FALSE FALSE FALSE 0.2504006
10         1 FALSE FALSE  TRUE 0.9715395
..       ...   ...   ...   ...       ...

By the way, the openWAR package has special functions that implement these calculations. To illustrate, I use the
getRunEx function to output a function fit.rem and I can use this function with base state and outs arguments to get the same run expectancies. Below I compute the expected number of runs starting with bases loaded and 0 outs which essentially agrees with my calculations above.

fit.rem <- getRunEx(d2015)
fit.rem(7, 0)
[1] 2.329819

With two applications of inner_join , I merge these run expectancies with d2015 — I now have the expected runs before and after each play. Then by using mutate and the runsOnPlay variable, I create a new variable Runs which is the runs value of the play.

d2015 <- inner_join(d2015, runs_expectancy,
                    by=c("startOuts", "R1B", "R2B", "R3B"))
d2015 <- inner_join(d2015, runs_expectancy,
                    by=c("endOuts"="startOuts",
                         "E1B"="R1B",
                         "E2B"="R2B",
                         "E3B"="R3B"))
d2015 <- mutate(d2015,
                Runs=Runs.y - Runs.x + runsOnPlay)

Values of different events

One topic we considered in Chapter 5 of Analyzing Baseball with R was the average run value of different types of hits (single, home run, etc.).
One can find average runs values for each type of event by use of the summarize function.

S <- summarize(group_by(d2015, event), M=mean(Runs))
S
Source: local data frame [31 x 2]

         event          M
        (fctr)      (dbl)
1       Double  0.7523109
2       Flyout -0.2426926
3     Forceout -0.3357790
4    Groundout -0.2077029
5  Intent Walk  0.1963707
6      Lineout -0.2448118
7      Pop Out -0.2634456
8   Runner Out -0.2621236
9       Single  0.4436674
10   Strikeout -0.2548317
..         ...        ...

These values seem to agree closely with the mean run values reported in Chapter 5 using data from a different season.

Plotting location data

I noticed that the GameDay data includes the (x, y) locations of all batted balls. (Actually the variables I used were our.x and our.y .) This motivated me to write a short function plot_locations . One inputs the name of the player and the function produces a graph of the locations of all balls placed in-play.

To use this function you need to also input the GameDay data frame d2015 . (If you have not downloaded any GameDay data yet, you can try plot_locations using the May, 2013 GameDay data May included with the openWAR package.) Since the GameDay data has only numerical player codes, I found a spreadsheet from Baseball Prospectus that matches up the codes with the player names. So you have to download this file first.

playerids <- read.csv("http://www.baseballprospectus.com/sortable/playerids/playerid_list.csv")
plot_locations <- function(player, d2015){
  # assumes that player id is contained in data frame playerids
  require(ggplot2)
  require(dplyr)
  c1 <- 90 / sqrt(2)
  diamond <- data.frame(x=c(0, c1, 0, -c1, 0),
                        y=c(0, c1, 2 * c1, c1, 0))
  pnames <- unlist(strsplit(player, split=" "))
  id <- select(filter(playerids, 
      LASTNAME==pnames[2], FIRSTNAME==pnames[1]), MLBCODE)
  d <- filter(d2015, batterId==id$MLBCODE)
  ggplot(data=filter(d,
                     event=="Single" | event=="Lineout" |
                     event=="Groundout" | event=="Pop Out" |
                     event=="Flyout" | event=="Double"),
         aes(our.x, our.y, color=event)) +
    geom_point() + xlab("") + ylab("") +
    geom_path(data=diamond, aes(x, y), color="red", size=1.5) +
    coord_fixed(ratio=1) +
    ggtitle(paste(player, "'s 2015 Batted Balls In-Play", sep=""))
}

Here is a graph showing the locations of all batted balls of Bryce Harper during his 2015 MVP season.

plot_locations("Bryce Harper", d2015)

bryceharper2015

Concluding comments

I really have not said much about capabilities of the openWAR package yet. But the package certainly is useful for downloading the MLBAM GameDay data. I’ll talk more about this package in future blog posts.