Building an Expected Run Matrix with openWAR

Last time, we learned how to download MLBAM GameDay data using the openWAR package. Specifically, we downloaded data from a single game and created some simple summaries.

Most often, we’ll be interested in investigated data from many games. The function getData() will download data over any time interval in which you are interested. Let’s figure out how many home runs were hit on May 14th, 2013.

require(openWAR)
ds = getData(start = "2013-05-14")
subset(ds, event == "Home Run", select = c("gameId", "batterId", "description"))

Visualizing the data

One nice aspect of the MLBAM data is that it contains an (x,y)-coordinate indicated the location of each batted ball hit into play. getData() returns a data.frame of class GameDayPlays. We have written plot.GameDayPlays() function for visualizing hit location data with a generic field overlaid.

plot(data = ds) Note that we have done some of the work to normalize the coordinates provided by MLBAM – though there is still more to be done.

Modeling

In order to compute openWAR, we need to model several quantities. The first thing we need to understand is the relative value of each “state” of a half-inning. Since there are three bases, each of which can be either occupied or unoccupied, and there are three possible numbers of outs, each plate appearance begins with the half-inning in one of 25 possible states (the 24 states, plus one last state for three outs). We would like to assign a value to each one of these states that indicates the expected number of runs that will be scored in the remainder of that half-inning. We have precomputed the states and the number of runsFuture associated with each play.

Thus, we want to fit the model $runsFuture \sim startCode + startOuts + startCode \cdot startOuts,$

where startCode is a description of the configuration of the baserunners, and startOuts is the number of outs in the half-inning at the beginning of the plate appearance.

For example, consider the bottom of the 1st inning of the Mets-Braves game from August 12th of 2012:

gd = gameday()
subset(gd$ds, inning == 1 & half == "bottom", select = c("runsFuture", "runsOnPlay", "startCode", "startOuts", "description")) The Mets scored two runs in the inning, and thus, when Ruben Tejada opened the inning, there were no runners on base, no outs, but two runsFuture were associated with this play. After Tejada flew out, there was one out, but still no one on base and two runsFuture. After Mike Baxter singles, David Wright came to the plate with a runner on first (startCode = 1), one out, and two runsFuture. His double scored one run, so Ike Davis followed with a runner on third, one out, and now only one runsFuture. By the time Daniel Murphy bats, there are no further runsFuture in the inning. Every inning begins with no one on and no one out. In this example, two runs scored in the inning. By averaging over all innings, we create an estimate of the expected runsFuture for the state (0,0). But we can just as easily do the same for all states. Building a model for expected runs The full data from 2013 is currently included in the openWAR package, although it seems likely that for legal reasons, this practice will most likely not continue. data(MLBAM2013) Note that this data.frame contains information about more than 185 thousand plate appearances from the 2013 season. nrow(MLBAM2013) The simplest way to build a model for runsFuture is to just take the average over all observations. require(plyr) ddply(MLBAM2013[, c("runsFuture", "startCode", "startOuts")], ~startCode + startOuts, mean) Note that this results in exactly the same coefficients as would be returned by either an ANOVA or multiple regression model, as long as we are careful to force R to consider our explanatory variables as categorical. coef(aov(runsFuture ~ factor(startCode) * factor(startOuts), data = MLBAM2013)) coef(lm(runsFuture ~ factor(startCode) * factor(startOuts), data = MLBAM2013)) In the openWAR package, we have written a function getRunEx() that will take a GameDayPlays object and return a function. The inning began in the state (0,0). Our estimate $\hat{\rho}$(0,0) of the expected value (in runs) of that state is: fit.rem = getRunEx(MLBAM2013) fit.rem(baseCode = 0, outs = 0) Note that by default, getRunEx() will drop any incomplete innings (e.g. innings that didn’t finish), so that the estimate for $\hat{\rho}$(0,0) is slightly different from the 0.470176 value that we saw above. It will also return a value of 0 for any input with three outs. Now we can use the fit.rem() function to assess changes in the expected state of the game. For example, let’s compute the value of RE24 for the Mets in 2013: nym = subset(MLBAM2013, (home_teamId == 121 | away_teamId == 121) & field_teamId != 121) nym$re.start = mapply(fit.rem, nym$startCode, nym$startOuts)

As expected, we see David Wright at the top of the list.

1. 