Ben Baumer and Greg Matthews have a relatively new R package ` openWAR`

that facilitates calculation of WAR based on methodology from their recently published paper in the Journal of Quantitative Analysis of Sports. See their github site for a description of installing the package.

One attractive feature of the `openWAR`

package is that it allows for downloading of MLBAM GameDay play-by-play data. Recently, I illustrated the use of the `retrosheet`

package to download similar data from Retrosheet. We’ll see that the data frames that we get from MLBAM GameDay contain many more variables that facilitate developing interesting analysis.

In this post, I illustrate the downloading of play-by-play data, illustrate run expectancy calculations, and use the MLBAM GameDay data to show the locations of all in-play events for specific players.

### Downloading MLBAM data

Once you have loaded the `openWAR`

package, downloading data is remarkably easy. For example, to download all GameDay data for the games in May 2015, I use the `getData`

function with starting and end dates.

library(openWAR) may <- getData(start="2015-05-01", end="2015-05-31")

The output is a data frame with 62 variables where each row corresponds to a plate appearance.

To get data for the entire season, I used the ` getData `

repeatedly for each month, and then I combined the data frames to get a single data frame ` d2015 `

for the whole season. (There were some problems in downloading data for specific games — when I was done I had data for 2414 games, which represented over 99 percent of all games.)

### Run expectancy calculations

In Chapter 5 of Analyzing Baseball with R, we spend some time explaining how to compute run values for individual plays from the Retrosheet play by play files. Using all of the GameDay variables together with the ` dplyr `

package functions, it is easy to replicate these calculations.

First I create new variables ` R1B, R2B, R3B, E1B, E2B, E3B `

that indicate the presence of a runner on each base before and after the play.

library(dplyr) d2015 <- mutate(d2015, R1B=!is.na(start1B), R2B=!is.na(start2B), R3B=!is.na(start3B), E1B=!is.na(end1B), E2B=!is.na(end2B), E3B=!is.na(end3B))

Using the ` summarize `

function I find the mean runs in the remainder of the inning (there already is a ` runsFuture `

variable available in the data frame).

(runs_expectancy <- summarize(group_by(d2015, startOuts, R1B, R2B, R3B), Runs=mean(runsFuture))) Source: local data frame [25 x 5] Groups: startOuts, R1B, R2B [?] startOuts R1B R2B R3B Runs (dbl) (lgl) (lgl) (lgl) (dbl) 1 0 FALSE FALSE FALSE 0.4735825 2 0 FALSE FALSE TRUE 1.3833780 3 0 FALSE TRUE FALSE 1.1054657 4 0 FALSE TRUE TRUE 2.0896130 5 0 TRUE FALSE FALSE 0.8602090 6 0 TRUE FALSE TRUE 1.7195652 7 0 TRUE TRUE FALSE 1.4765471 8 0 TRUE TRUE TRUE 2.2954876 9 1 FALSE FALSE FALSE 0.2504006 10 1 FALSE FALSE TRUE 0.9715395 .. ... ... ... ... ...

By the way, the ` openWAR `

package has special functions that implement these calculations. To illustrate, I use the

` getRunEx `

function to output a function ` fit.rem `

and I can use this function with base state and outs arguments to get the same run expectancies. Below I compute the expected number of runs starting with bases loaded and 0 outs which essentially agrees with my calculations above.

fit.rem <- getRunEx(d2015) fit.rem(7, 0) [1] 2.329819

With two applications of ` inner_join `

, I merge these run expectancies with ` d2015 `

— I now have the expected runs before and after each play. Then by using ` mutate `

and the ` runsOnPlay `

variable, I create a new variable ` Runs `

which is the runs value of the play.

d2015 <- inner_join(d2015, runs_expectancy, by=c("startOuts", "R1B", "R2B", "R3B")) d2015 <- inner_join(d2015, runs_expectancy, by=c("endOuts"="startOuts", "E1B"="R1B", "E2B"="R2B", "E3B"="R3B")) d2015 <- mutate(d2015, Runs=Runs.y - Runs.x + runsOnPlay)

### Values of different events

One topic we considered in Chapter 5 of Analyzing Baseball with R was the average run value of different types of hits (single, home run, etc.).

One can find average runs values for each type of event by use of the ` summarize `

function.

S <- summarize(group_by(d2015, event), M=mean(Runs)) S Source: local data frame [31 x 2] event M (fctr) (dbl) 1 Double 0.7523109 2 Flyout -0.2426926 3 Forceout -0.3357790 4 Groundout -0.2077029 5 Intent Walk 0.1963707 6 Lineout -0.2448118 7 Pop Out -0.2634456 8 Runner Out -0.2621236 9 Single 0.4436674 10 Strikeout -0.2548317 .. ... ...

These values seem to agree closely with the mean run values reported in Chapter 5 using data from a different season.

### Plotting location data

I noticed that the GameDay data includes the (x, y) locations of all batted balls. (Actually the variables I used were ` our.x `

and ` our.y `

.) This motivated me to write a short function ` plot_locations `

. One inputs the name of the player and the function produces a graph of the locations of all balls placed in-play.

To use this function you need to also input the GameDay data frame ` d2015 `

. (If you have not downloaded any GameDay data yet, you can try ` plot_locations `

using the May, 2013 GameDay data ` May `

included with the ` openWAR `

package.) Since the GameDay data has only numerical player codes, I found a spreadsheet from Baseball Prospectus that matches up the codes with the player names. So you have to download this file first.

playerids <- read.csv("http://www.baseballprospectus.com/sortable/playerids/playerid_list.csv") plot_locations <- function(player, d2015){ # assumes that player id is contained in data frame playerids require(ggplot2) require(dplyr) c1 <- 90 / sqrt(2) diamond <- data.frame(x=c(0, c1, 0, -c1, 0), y=c(0, c1, 2 * c1, c1, 0)) pnames <- unlist(strsplit(player, split=" ")) id <- select(filter(playerids, LASTNAME==pnames[2], FIRSTNAME==pnames[1]), MLBCODE) d <- filter(d2015, batterId==id$MLBCODE) ggplot(data=filter(d, event=="Single" | event=="Lineout" | event=="Groundout" | event=="Pop Out" | event=="Flyout" | event=="Double"), aes(our.x, our.y, color=event)) + geom_point() + xlab("") + ylab("") + geom_path(data=diamond, aes(x, y), color="red", size=1.5) + coord_fixed(ratio=1) + ggtitle(paste(player, "'s 2015 Batted Balls In-Play", sep="")) }

Here is a graph showing the locations of all batted balls of Bryce Harper during his 2015 MVP season.

plot_locations("Bryce Harper", d2015)

### Concluding comments

I really have not said much about capabilities of the ` openWAR `

package yet. But the package certainly is useful for downloading the MLBAM GameDay data. I’ll talk more about this package in future blog posts.

Reblogged this on Stats in the Wild.

Hi!

I’m really excited that this is available! Could you provide (or point me in the direction of) an explanation of the 62 in the data frame? Or, more specifically, could you talk about the difference between variables x,y and our.x, our.y.

Thanks so much!

Daniel, I don’t know much about the 62 variables although I suppose there is a glossary somewhere. Most of the variables seemed self-explanatory. our.x and our.y seemed to make sense based on my knowledge of the ballpark dimensions.

Hi Jim,

If you come across a glossary, let me know.

Looks like the only difference between our.x, our.y and x,y is where zero is defined to be. Check out these plots. They look identical, but the axes have very different values.

Thanks