pitchRx, meet openWAR

Both pitchRx and openWAR scrape Gameday data, but are quite different in their implementation. This post will help expose those differences and show how to combine them to make 1 + 1 > 2.

Similarities/differences between openWAR and pitchRx

Both openWAR and pitchRx can obtain so-called “play-by-play” Gameday data, but provide it in a different format. One reason for this difference is due to pitchRx's agnostic approach to the type of analysis you'd like to perform. It simply returns data in a collection of tables that closely resembles the raw form. On the other hand, openWAR returns the play-by-play data in a format which is convenient for computing openWAR estimates. Fortunately, I think this format is quite useful for a wide variety of problems and enhances a PITCHf/x database created using pitchRx.

On a side note, the data format returned by openWAR::getData could be produced from tables returned by scrape, but it's not easy (the 'atbat', 'runner', 'action', 'hip', and 'player' tables collectively contain the same info). So, in effect, it does not add new data to a pitchRx database, but it can dramatically simplify certain queries (e.g. searching for PITCHf/x by catcher, shortstop, center fielder, etc). Before “Adding openWAR data to your PITCHf/x database” (the last section below), I'll explain in detail how to acquire play-by-play data for a particular set of games since both pitchRx and openWAR have this capability. Along the way, we'll also learn how to merge openWAR's play-by-play data with pitchRx pitch-by-pitch data.

Game specific data acquistion

Just like pitchRx::scrape, openWAR::getData can acquire data from any set of games using MLBAM's gameday identifiers. pitchRx comes prepackaged will all the available game identifiers which we can use to query a specific set of games over an extended period of time. To understand how this works, let's take a look at the format of these identifiers.

Understanding gameday identifiers

data(gids, package = "pitchRx")
## [1] "gid_2008_02_26_fanbbc_phimlb_1" "gid_2008_02_26_flsbbc_detmlb_1"
## [3] "gid_2008_02_26_umibbc_flomlb_1" "gid_2008_02_26_umwbbc_nynmlb_1"
## [5] "gid_2008_02_27_cinmlb_phimlb_1" "gid_2008_02_27_colmlb_chamlb_1"

All the ids have the general form 'gid_YYYY_MM_DD_xxxaaa_yyybbb_1'. The 'gid_' part of this identifier is the only part that does not change from game to game. Most identifiers will end with '_1', but only if it is not the second game of a doubleheader. The 'YYYY', 'MM' and 'DD' parts are numeric representations of the year, month and day (respectively). The 'xxx' part is an abbreviation for the away team name and the 'aaa' part is an abbreviation of the away team's league. The same goes for the 'yyybbb' part except this abbreviation is for the home team. As you can see from the output above, some of these identifiers correspond to preseason games (the regular season doesn't usually begin until the start of April).

Using regular expressions

Regular expressions are useful for extracting a particular set of gameday identifiers since they allows us to search for patterns in a set of strings. For instance, this regular expression would match every Twins home game in 2013:

reg.ex <- "2013_[0-9]{2}_[0-9]{2}_[a-z]{6}_minmlb"

For those not familiar with regular expressions, the [0-9]{2} bit says “any two numbers may go here” and the [a-z]{6} bit says “any six letters (from the Roman alphabet) may go here”. Remember that gids contains preseason, regular season and postseason games. So, if by “every home game” we really meant “every regular season home game”, we can easily restrict to games in April through September (if we change [0-9]+ to 0[4-9] in the month slot).

home <- "2013_0[4-9]_[0-9]{2}_[a-z]{6}_minmlb"

Now, to use this expression to obtain the relevant gids, we can use the base R function grepl (or grep). grepl returns a logical vector the same length the object it is searching. A particular element of that vector will be TRUE if and only if the regular expression found a match. Thus, we can extract the relevant elements of gids by passing this logical vector to R's [ operator like so:

reg.home <- gids[grepl(home, gids)]

Let's do the same thing for away games so that we have every Twins's regular season game from 2013.

reg.away <- gids[grepl("2013_0[4-9]_[0-9]{2}_minmlb_[a-z]{6}", gids)]
reg.season <- c(reg.home, reg.away)
## [1] 167

Not all identifiers are created equal

According to reg.season, the Twin's played 167 regular season games. This is clearly more than the usual 162 games that we would expect. Just for investigative purposes, I'll search my PITCHf/x database to find out which of these games were not considered “Final”.

db <- src_sqlite("~/pitchfx/pitchRx.sqlite3")
games <- tbl(db, "game") %>% select(gameday_link, status) %>% 
  filter(status != "Final") %>% collect()
subset(games, gameday_link %in% sub("gid_", "", reg.season))
## Source: local data frame [5 x 2]
##                   gameday_link    status
## 324 2013_04_14_nynmlb_minmlb_1 Postponed
## 328 2013_04_17_anamlb_minmlb_1 Postponed
## 333 2013_04_19_minmlb_chamlb_1 Postponed
## 334 2013_04_22_miamlb_minmlb_1 Postponed
## 350 2013_06_07_minmlb_wasmlb_1 Postponed

These “Postponed” games would explain the discrepancy between the length of reg.season and the true length of the regular season. Fortunately, both pitchRx::scrape and openWAR::getData are smart enough to “ignore” this issue when collecting play-by-play and/or PITCHf/x data. In fact, if we navigate to the gameday page specific to gid_2013_04_14_nynmlb_minmlb_1, we'll see that the 'inning' directory where play-by-play and PITCHf/x would appear is missing.

Combining forces

At this point, once we have the gameday identifiers of interest, it is very simple to use openWAR::getData and/or pitchRx::scrape to obtain data for these specific games. Just for the sake of not abusing MLBAM's website, I'll acquire data from just the first two regular season games.

plays <- openWAR::getData(gameIds = reg.season[1:2])
dat <- pitchRx::scrape(game.ids = reg.season[1:2])

The data returned by openWAR::getData could be seen as a replacement of sorts to the 'atbat' table returned by pitchRx::scrape. Both of them have one atbat per record (meaning the atbat order and gameday identifier columns together form a unique key). Before we merge openWAR's play-by-play data with pitchRx's pitch-by-pitch data, we'll have to clean up a few of the column names.

# Get column names to match (needed in order to join play-by-play & pitch-by-pitch)
names(plays) <- sub("ab_num", "num", names(plays))
names(plays) <- sub("gameId", "gameday_link", names(plays))
# There are two set of 'x' and 'y' columns that mean two different things
# Let's rename the pitch location 'x' and 'y' to 'old_x' and 'old_y' since
# these are pitch locations under the 'old' Gameday coordinate system
names(dat$pitch) <- sub("^x$", "old_x", names(dat$pitch))
names(dat$pitch) <- sub("^y$", "old_y", names(dat$pitch))
pitchfx <- dplyr::inner_join(dat$pitch, plays, by = c("num", "gameday_link"))

Now we can easily do stuff like search pitches by player positions. For instance, maybe you want to called strikes where Joe Mauer was playing catcher. The pitchfx data.frame currently has player IDs for each player position, but we can easily map those to player names using the players data.frame which comes with pitchRx.

data(players, package = "pitchRx")
names(players) <- c("playerId.C", "catcherName")
pitchfx %>%
  dplyr::left_join(players, by = "playerId.C") %>%
  filter(catcherName == "Joe Mauer") %>%
  filter(des == "Called Strike") %>%
  strikeFX() + facet_wrap( ~ stand) + 
  theme_bw() + coord_equal()

plot of chunk players

Adding openWAR data to your PITCHf/x database

If you have a PITCHf/x database that was obtained using pitchRx, it's pretty simple to add openWAR's play-by-play data to your database.

db <- src_sqlite("~/pitchfx/pitchRx.sqlite3")
# Acquire 2008 play-by-play data
dat08 <- openWAR::getData(start = "2008-01-01", end = "2009-01-01")
# Creates new table named 'openWAR' in database with 08 data
pitchRx::export(db$con, value = dat08, name = "openWAR")
# Acquire 2009 play-by-play
dat09 <- openWAR::getData(start = "2009-01-01", end = "2010-01-01")
# Appends 09 data to the 'openWAR' table
pitchRx::export(db$con, value = dat09, name = "openWAR")
# Keep repeating this process...

To query your database with your newly added openWAR table, you can follow the principles outlined here (replacing the 'atbat' table with the 'openWAR' table)


11 responses

  1. There’s also the fact that “Sxslt” hasn’t been ported to windows, if I’m running my windows machine, I’m forced to use ptichRx.

    1. Hi Martin, Sxslt does run on Windows. I’ve got it running on XP. You just need to install an external XSLT library.

  2. I’m a beginner in this space. I’ve purchased this book and have taken SABR101x and am trying to learn more about R. But I am unable to install SXSLT as well. I’m running Windows 7 and get the following error message:

    * installing *source* package ‘Sxslt’ …
    Warning: running command ‘sh ./configure.win’ had status 127
    ERROR: configuration failed for package ‘Sxslt’
    * removing ‘C:/Program Files/R/R-3.1.0/library/Sxslt’
    Warning in install.packages :
    running command ‘”C:/PROGRA~1/R/R-31~1.0/bin/x64/R” CMD INSTALL -l “C:\Program Files\R\R-3.1.0\library” C:\Users\DD\AppData\Local\Temp\Rtmp4gbXuB/downloaded_packages/Sxslt_0.91-2.tar.gz’ had status 1
    Warning in install.packages :
    installation of package ‘Sxslt’ had non-zero exit status

    1. Sorry for the late reply. Have you tried following the installation instructions here?


  3. Yes. I received the error message above when I tried:
    install.packages(“Sxslt”, repos = “http://www.omegahat.org/R”, type = “source”)

    I have not tried this: http://www.omegahat.org/Sxslt/INSTALL.html

    Like I said, I’m a beginner but trying hard to conceptually understand this. I wasn’t sure if I needed to/could load libxml (2.4.1) and libxslt (1.0.1) on Windows 7 (instructions say no one’s tried to port this to Windows.

    1. Did you solve it? I got the same errors on Win 8.

      1. I have not. I found a better tutorial for how to install openWAR: http://www.datascienceriot.com/?p=185

        It shows you where to get the libxml files and instructions for where to put them. But as I said on my original response, I am not very skilled in this area so the libxml installation instructions don’t mean much to me.

        Seems like everyone that installs this is using Linux or MAC OS. Everyone talks about Windows weirdness but don’t know for sure because they’re using Linux/MAC.

    2. A programmer named Igor Zlatkovic is the “keeper of the Windows port,” lfor Libxml he discusses Windows issues here. Apparently there is a dependency tree and you have to make sure everything is installed.


  4. Great post by the way. This works like a charm on Ubuntu 14.04. I haven’t tried OpenWAR with my Windows machine but when I get that running I’ll come back and try to help out.

  5. I’m having an issue with doubleheader games, it seems that pitchRx does not have these games. For example, there were 3 doubleheaders on May 1, 2014 and I can’t get any of them. I used the identifiers above, for ex. reg.away <- gids[grepl("2014_05_01_lanmlb_[a-z]{6}_2", gids)]. Also pit and tba played double headers as the away team. They all come back empty. Also, the sqlite database I built using the scrape function does not seem to include any doubleheaders either. Just curious if I am doing something wrong or is this a similar issue for everyone.

  6. Hi Gerald,

    Thanks for reporting this. You’re definitely right that the second game of doubleheaders on May 1st, 2014 are missing —

    gids[grep(“2014_05_01.*”, gids)]
    # [1] “gid_2014_05_01_lanmlb_minmlb_1” “gid_2014_05_01_seamlb_nyamlb_1”
    # [3] “gid_2014_05_01_atlmlb_miamlb_1” “gid_2014_05_01_milmlb_cinmlb_1”
    # [5] “gid_2014_05_01_tbamlb_bosmlb_1” “gid_2014_05_01_tormlb_kcamlb_1”
    # [7] “gid_2014_05_01_nynmlb_colmlb_1”

    That being said, I wouldn’t say *all* the second games of doubleheaders are missing —

    length(grep(“2$”, gids))
    # [1] 165

    These also appear in my database.

    db <- src_sqlite("~/pitchfx/pitchRx.sqlite3")
    links % select(gameday_link) %>%
    collect() %>% unique()
    length(grep(“2$”, links$gameday_link))
    # [1] 165

    For a simple fix, you could grab the gameday ids for these games —

    a = html(“http://gd2.mlb.com/components/game/mlb/year_2014/month_05/day_01/”) %>%
    html_nodes(“a”) %>% html_text %>% str_trim
    missed <- sub("/$", "", a[grepl("^gid_.*2/$", a)])
    # [1] "gid_2014_05_01_lanmlb_minmlb_2" "gid_2014_05_01_pitmlb_balmlb_2"
    # [3] "gid_2014_05_01_tbamlb_bosmlb_2"

    And pass these to the game.ids argument of the scrape() function (in this case, I'm using it to append data from these games to my database).

    scrape(game.ids = missed, connect = db$con)

    For posterity, I will add these ids to data(gids) in future versions of pitchRx.



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: