Introduction to openWAR

Hello, world! I’m very excited to be taking part in this blog, and am looking forward to sharing my enthusiasm for two great things that are great together: baseball and R!

In this initial post, I’m going to introduce you to a new R package that I have been developing with Greg Matthews, a biostatistician at UMass. The package is called openWAR, because our goal is to produce a fully open-source, reproducible version of Wins Above Replacement (WAR). Our paper on the subject is on the arXiv, but even if this doesn’t interest you, you might still be interested in the package, because it can do a lot of things that aren’t related to WAR at all.

First, the package contains functions for downloading and processing the XML files that power the MLBAM GameDay web application. Carson Sievert has written a similar package called pitchRx for the PITCHf/x data, but openWAR works with the play-by-play data — not the pitch-by-pitch data. Although this data is not “free as in freedom”, it is “free as in beer.”

Installing openWAR

openWAR is not yet on CRAN, but it is on GitHub. Currently, openWAR relies on Duncan Temple Lang’s Sxslt package, which provides XSLT functionality from within R, and this leads to a particularly elegant method of transforming the raw XML files from MLBAM into nice data frames in R. Unfortunately, this package is not on CRAN either, but rather is hosted by Omega Hat. You can install it using the repos argument:

install.packages("Sxslt", repos = "http://www.omegahat.org/R", type = "source")

Depending on your operating system, you may need to install basic XSLT functionality, which will take place outside of R. Please see the Sxslt installation instructions for more details on how to do this.

Next, installing openWAR is best accomplished through the install_github() function in the devtools package.

require(devtools)
 install_github("openWAR", "beanumber")

Accessing MLBAM data

The base class in openWAR is called gameday, and it collects information about a single major league game (in principle, minor leagues games could be included just as easily, but right now the parsers will only download major league data). An object of class gameday can be created if you know the MLBAM ID for the game you want to investigate. How are you supposed to know this ID? We’ve written a function that will figure this out for you.

Let’s say that you want the list of games that were played on July 21st, 2013. We can ask for the list of games:

require(openWAR)
 getGameIds(date=as.Date("2013-07-21"))

 

Retrieving data from 2013-07-21 ...
...found 15 games [1] "gid_2013_07_21_arimlb_sfnmlb_1" "gid_2013_07_21_atlmlb_chamlb_1" "gid_2013_07_21_balmlb_texmlb_1"
[4] "gid_2013_07_21_chnmlb_colmlb_1" "gid_2013_07_21_clemlb_minmlb_1" "gid_2013_07_21_detmlb_kcamlb_1"
[7] "gid_2013_07_21_lanmlb_wasmlb_1" "gid_2013_07_21_miamlb_milmlb_1" "gid_2013_07_21_nyamlb_bosmlb_1"
[10] "gid_2013_07_21_oakmlb_anamlb_1" "gid_2013_07_21_phimlb_nynmlb_1" "gid_2013_07_21_pitmlb_cinmlb_1"
[13] "gid_2013_07_21_sdnmlb_slnmlb_1" "gid_2013_07_21_seamlb_houmlb_1" "gid_2013_07_21_tbamlb_tormlb_1"

Since Jim is a Phillies fan, let’s investigate the Mets-Phillies game that was played on that date.

 gd = gameday(gameId="gid_2013_07_21_phimlb_nynmlb_1")
 summary(gd)
 class(gd)
 

 

Length Class      Mode
gameId  1     -none-     character
base    1     -none-     character
url     5     -none-     character
ds     62     data.frame list
[1] "gameday"

You can see the gd object is of class gameday, and has four components: the gameId, the base MLBAM URL, URLs for the five different XML files from which it gathers its information, and finally a data.frame that contains 62 variables for every play in game. Let’s take a closer look at what is in this data.frame.

 head(gd$ds)
 
pitcherId batterId field_teamId ab_num inning   half balls strikes endOuts        event actionId
6    518774   276519          121      1      1    top     0       0       1       Flyout       NA
7    518774   276545          121      2      1    top     3       2       2    Groundout       NA
8    518774   400284          121      3      1    top     1       2       2 Hit By Pitch       NA
9    518774   502126          121      4      1    top     2       3       3    Strikeout       NA
1    424324   458913          143      5      1 bottom     1       2       1    Groundout       NA
2    424324   502517          143      6      1 bottom     0       2       2       Flyout       NA
description stand throws
6                              Jimmy Rollins flies out to right fielder Marlon Byrd.       L      R
7 Michael Young grounds out, third baseman David Wright to first baseman Josh Satin.       R      R
8                                                          Chase Utley hit by pitch.       L      R
9                                                Domonic Brown strikes out swinging.       L      R
1   Eric Young grounds out, shortstop Jimmy Rollins to first baseman Kevin Frandsen.       R      L
2                      Daniel Murphy flies out softly to left fielder Domonic Brown.       L      L
runnerMovement      x      y game_type home_team home_teamId home_lg
6                                                    172.69  85.34         R       nyn         121      NL
7                                                    103.41 163.65         R       nyn         121      NL
8                         [400284::1B::Hit By Pitch]     NA     NA         R       nyn         121      NL
9 [400284:1B:2B::Passed Ball][400284:2B:::Strikeout]     NA     NA         R       nyn         121      NL
1                                                    106.43 152.61         R       nyn         121      NL
2                                                    112.45 115.46         R       nyn         121      NL
away_team away_teamId away_lg venueId    stadium           timestamp playerId.C playerId.1B playerId.2B
6       phi         143      NL    3289 Citi Field 2013-07-21 17:11:38     407833      543744      502517
7       phi         143      NL    3289 Citi Field 2013-07-21 17:12:39     407833      543744      502517
8       phi         143      NL    3289 Citi Field 2013-07-21 17:15:05     407833      543744      502517
9       phi         143      NL    3289 Citi Field 2013-07-21 17:17:03     407833      543744      502517
1       phi         143      NL    3289 Citi Field 2013-07-21 17:21:34     456124      435623      400284
2       phi         143      NL    3289 Citi Field 2013-07-21 17:23:55     456124      435623      400284
playerId.3B playerId.SS playerId.LF playerId.CF playerId.RF batterPos batterName pitcherName runsOnPlay
6      431151      435560      458913      501571      407781        SS    Rollins      Harvey          0
7      431151      435560      458913      501571      407781        3B   Young, M      Harvey          0
8      431151      435560      458913      501571      407781        2B      Utley      Harvey          0
9      431151      435560      458913      501571      407781        LF   Brown, D      Harvey          0
1      276545      276519      502126      460055      430321        LF   Young, E     Lee, Cl          0
2      276545      276519      502126      460055      430321        2B Murphy, Dn     Lee, Cl          0
startOuts runsInInning runsITD runsFuture start1B start2B start3B  end1B end2B end3B outsInInning startCode
6         0            0       0          0    <NA>    <NA>    <NA>   <NA>  <NA>  <NA>            3         0
7         1            0       0          0    <NA>    <NA>    <NA>   <NA>  <NA>  <NA>            3         0
8         2            0       0          0    <NA>    <NA>    <NA> 400284  <NA>  <NA>            3         0
9         2            0       0          0  400284    <NA>    <NA>   <NA>  <NA>  <NA>            3         1
1         0            2       0          2    <NA>    <NA>    <NA>   <NA>  <NA>  <NA>            3         0
2         1            2       0          2    <NA>    <NA>    <NA>   <NA>  <NA>  <NA>            3         0
endCode fielderId                         gameId isPA  isAB isHit isBIP     our.x     our.y        r    theta
6       0    407781 gid_2013_07_21_phimlb_nynmlb_1 TRUE  TRUE FALSE  TRUE 119.01855 283.65796 307.6154 1.173521
7       0    431151 gid_2013_07_21_phimlb_nynmlb_1 TRUE  TRUE FALSE  TRUE -53.88154  88.22197 103.3747 2.119083
8       1        NA gid_2013_07_21_phimlb_nynmlb_1 TRUE FALSE FALSE FALSE        NA        NA       NA       NA
9       0        NA gid_2013_07_21_phimlb_nynmlb_1 TRUE  TRUE FALSE FALSE        NA        NA       NA       NA
1       0    276519 gid_2013_07_21_phimlb_nynmlb_1 TRUE  TRUE FALSE  TRUE -46.34461 115.77418 124.7056 1.951563
2       0    502126 gid_2013_07_21_phimlb_nynmlb_1 TRUE  TRUE FALSE  TRUE -31.32067 208.48835 210.8278 1.719909

There is a great deal of information collected here — it should be comparable to Retrosheet. We can do some simple things like pull out the scoring plays:

subset(gd$ds, runsOnPlay > 0, select="description")
description
3                                        Play reviewed and stands as called: David Wright homers (15) on a line drive to left center field.
4                                                                                      Marlon Byrd homers (17) on a fly ball to left field.
26 Play reviewed and overturned: Juan Lagares homers (2) on a line drive to left center field.   David Wright scores.    Josh Satin scores.

compute a linescore for the game:

 require(plyr)
 ddply(gd$ds, ~inning, summarise, PHI = sum(ifelse(half == "top", runsOnPlay, 0)), NYM = sum(ifelse(half == "bottom", runsOnPlay, 0)))
 
inning PHI NYM
1      1   0   2
2      2   0   0
3      3   0   0
4      4   0   3
5      5   0   0
6      6   0   0
7      7   0   0
8      8   0   0
9      9   0   0

Or the final totals:

 ddply(gd$ds, ~half, summarise, PA = sum(isPA), R = sum(runsOnPlay), H = sum(isHit))
 
half PA R H
1 bottom 32 5 7
2    top 32 0 4

How about the basic pitching lines:

 ddply(gd$ds, ~pitcherId, summarise, Name = pitcherName[1], BF = sum(isPA), IP = sum(endOuts - startOuts)/3, H = sum(isHit), R = sum(runsOnPlay), BB = length(grep("Walk", event)), SO = length(grep("Strikeout", event)), HR = length(grep("Home Run", event)))
 
pitcherId     Name BF IP H R BB SO HR
1    424324  Lee, Cl 26  6 7 5  1  6  3
2    425786 Atchison  7  2 1 0  0  2  0
3    449097 Papelbon  3  1 0 0  0  1  0
4    455374 Bastardo  3  1 0 0  0  2  0
5    518774   Harvey 25  7 3 0  0 10  0

This was not a bad day for Mr. Harvey. Now, you may see some discrepancies between the data that you download through openWAR and more authoritative sources. But based on our analysis, the fidelity of the data retrieved by openWAR is very good. We’ll verify this statement in a later post.

Clearly, there are a lot more interesting things that one can do with this data, but this is just a basic introduction. Next time we will explore openWAR‘s ability to download multiple games worth of information.

9 responses

  1. Thanks for this. Right now i’m thinking about scraping spring training data but I’m getting a funny error when installing the package ending with:

    Quitting from lines 61-62 (intro.Rmd)
    Error: processing vignette ‘intro.Rmd’ failed with diagnostics:
    could not find function “bbplot”
    Execution halted
    Error: Command failed (1)

  2. Jared, thanks for pointing this out. “bbplot” was a deprecated function that has been replaced by “plot”. I just committed a fix for this, so if you install the package again it should work now.

  3. This looks like an exciting package! Thanks for all the effort! I’m trying to install this for R3.03 on a windows 7 system, but received the following
    “In addition: Warning message:
    package ‘Sxslt’ is not available (for R version 3.0.3) “. I get the same error when attempting this on an R3.02 installation.
    Any ideas?

    1. Dennis, I think you’d get that message if you tried to install Sxslt from CRAN. Did you use:

      install.packages(“Sxslt”, repos = “http://www.omegahat.org/R”, type = “source”)

      Ben, many thanks, I’m up and running!

  4. Hi, thanks for the work on this. I was just looking through the code on github, and it’s impressive. (As far as I can tell — I’m very amateurish with R. But what I can read looks careful and thorough.)

    I had one question about the code, and reliance on timestamps. For example, lines 395-396 in GameDay.R are this comment:

    # IMPORTANT: Have to sort the data frame just in case
    # Have to sort by timestamp here, NOT by ab_num!

    (And then of course, the sorting is done.) Does it ever happen that Gameday ab_num are out of order? And how reliable are the timestamps? I was looking at one game, gid_2014_04_06_minmlb_clemlb_1, at bat number 36, and it looks strange to me. Brian Dozier stole 2b on the second pitch, but the action timestamp for the steal is 182959, which is before even the first pitch (183203). The same is true for the final pitch (183111).

    I wondered if perhaps the pitches (at least) were out of order but had the correct timestamp, but that would make no sense — the lastest timestamp is the penultimate pitch, but that’s a foul ball, and the ab ended in a K.

    tl;dr: I’m wondering how reliable these timestamps are, and if I’m just making a dumb mistake in how I’m reading the file. Anyone who knows more about this, I’d appreciate the info.

  5. What version of R are you using to run openWAR? I get the error message “package ‘Sxslt’ is not available (for R version 3.1.2)

  6. Seconded Andy. I’m receiving the same error 3.1.1

      1. Does it have anything to do with the fact that I’m using windows?
        Session Info:
        > sessionInfo()
        R version 3.1.1 (2014-07-10)
        Platform: x86_64-w64-mingw32/x64 (64-bit)

        Error message:
        > install.packages(‘Sxslt’, repos = ‘http://www.omegahat.org/R’, type = ‘source’)
        Installing package into ‘C:/~/Documents/R/win-library/3.1’
        (as ‘lib’ is unspecified)
        trying URL ‘http://www.omegahat.org/R/src/contrib/Sxslt_0.91-3.tar.gz’
        Content type ‘application/x-gzip’ length 200341 bytes (195 Kb)
        opened URL
        downloaded 195 Kb

        * installing *source* package ‘Sxslt’ …
        Please define LIB_XSLT
        Warning: running command ‘sh ./configure.win’ had status 1
        ERROR: configuration failed for package ‘Sxslt’
        * removing ‘C:/~/Documents/R/win-library/3.1/Sxslt’
        Warning in install.packages :
        running command ‘”C:/PROGRA~1/R/R-31~1.1/bin/x64/R” CMD INSTALL -l “C:\~\Documents\R\win-library\3.1” C:\~\AppData\Local\Temp\RtmpKKHayX/downloaded_packages/Sxslt_0.91-3.tar.gz’ had status 1
        Warning in install.packages :
        installation of package ‘Sxslt’ had non-zero exit status

        The downloaded source packages are in
        ‘C:\~\AppData\Local\Temp\RtmpKKHayX\downloaded_packages’

Leave a comment