Hello, world! I’m very excited to be taking part in this blog, and am looking forward to sharing my enthusiasm for two great things that are great together: baseball and R!
In this initial post, I’m going to introduce you to a new R package that I have been developing with Greg Matthews, a biostatistician at UMass. The package is called openWAR, because our goal is to produce a fully open-source, reproducible version of Wins Above Replacement (WAR). Our paper on the subject is on the arXiv, but even if this doesn’t interest you, you might still be interested in the package, because it can do a lot of things that aren’t related to WAR at all.
First, the package contains functions for downloading and processing the XML files that power the MLBAM GameDay web application. Carson Sievert has written a similar package called pitchRx for the PITCHf/x data, but openWAR works with the play-by-play data — not the pitch-by-pitch data. Although this data is not “free as in freedom”, it is “free as in beer.”
Installing openWAR
openWAR is not yet on CRAN, but it is on GitHub. Currently, openWAR relies on Duncan Temple Lang’s Sxslt package, which provides XSLT functionality from within R, and this leads to a particularly elegant method of transforming the raw XML files from MLBAM into nice data frames in R. Unfortunately, this package is not on CRAN either, but rather is hosted by Omega Hat. You can install it using the repos argument:
install.packages("Sxslt", repos = "http://www.omegahat.org/R", type = "source")
Depending on your operating system, you may need to install basic XSLT functionality, which will take place outside of R. Please see the Sxslt installation instructions for more details on how to do this.
Next, installing openWAR is best accomplished through the install_github() function in the devtools package.
require(devtools) install_github("openWAR", "beanumber")
Accessing MLBAM data
The base class in openWAR is called gameday, and it collects information about a single major league game (in principle, minor leagues games could be included just as easily, but right now the parsers will only download major league data). An object of class gameday can be created if you know the MLBAM ID for the game you want to investigate. How are you supposed to know this ID? We’ve written a function that will figure this out for you.
Let’s say that you want the list of games that were played on July 21st, 2013. We can ask for the list of games:
require(openWAR) getGameIds(date=as.Date("2013-07-21"))
Retrieving data from 2013-07-21 ... ...found 15 games [1] "gid_2013_07_21_arimlb_sfnmlb_1" "gid_2013_07_21_atlmlb_chamlb_1" "gid_2013_07_21_balmlb_texmlb_1" [4] "gid_2013_07_21_chnmlb_colmlb_1" "gid_2013_07_21_clemlb_minmlb_1" "gid_2013_07_21_detmlb_kcamlb_1" [7] "gid_2013_07_21_lanmlb_wasmlb_1" "gid_2013_07_21_miamlb_milmlb_1" "gid_2013_07_21_nyamlb_bosmlb_1" [10] "gid_2013_07_21_oakmlb_anamlb_1" "gid_2013_07_21_phimlb_nynmlb_1" "gid_2013_07_21_pitmlb_cinmlb_1" [13] "gid_2013_07_21_sdnmlb_slnmlb_1" "gid_2013_07_21_seamlb_houmlb_1" "gid_2013_07_21_tbamlb_tormlb_1"
Since Jim is a Phillies fan, let’s investigate the Mets-Phillies game that was played on that date.
gd = gameday(gameId="gid_2013_07_21_phimlb_nynmlb_1") summary(gd) class(gd)
Length Class Mode gameId 1 -none- character base 1 -none- character url 5 -none- character ds 62 data.frame list
[1] "gameday"
You can see the gd object is of class gameday, and has four components: the gameId, the base MLBAM URL, URLs for the five different XML files from which it gathers its information, and finally a data.frame that contains 62 variables for every play in game. Let’s take a closer look at what is in this data.frame.
head(gd$ds)
pitcherId batterId field_teamId ab_num inning half balls strikes endOuts event actionId 6 518774 276519 121 1 1 top 0 0 1 Flyout NA 7 518774 276545 121 2 1 top 3 2 2 Groundout NA 8 518774 400284 121 3 1 top 1 2 2 Hit By Pitch NA 9 518774 502126 121 4 1 top 2 3 3 Strikeout NA 1 424324 458913 143 5 1 bottom 1 2 1 Groundout NA 2 424324 502517 143 6 1 bottom 0 2 2 Flyout NA description stand throws 6 Jimmy Rollins flies out to right fielder Marlon Byrd. L R 7 Michael Young grounds out, third baseman David Wright to first baseman Josh Satin. R R 8 Chase Utley hit by pitch. L R 9 Domonic Brown strikes out swinging. L R 1 Eric Young grounds out, shortstop Jimmy Rollins to first baseman Kevin Frandsen. R L 2 Daniel Murphy flies out softly to left fielder Domonic Brown. L L runnerMovement x y game_type home_team home_teamId home_lg 6 172.69 85.34 R nyn 121 NL 7 103.41 163.65 R nyn 121 NL 8 [400284::1B::Hit By Pitch] NA NA R nyn 121 NL 9 [400284:1B:2B::Passed Ball][400284:2B:::Strikeout] NA NA R nyn 121 NL 1 106.43 152.61 R nyn 121 NL 2 112.45 115.46 R nyn 121 NL away_team away_teamId away_lg venueId stadium timestamp playerId.C playerId.1B playerId.2B 6 phi 143 NL 3289 Citi Field 2013-07-21 17:11:38 407833 543744 502517 7 phi 143 NL 3289 Citi Field 2013-07-21 17:12:39 407833 543744 502517 8 phi 143 NL 3289 Citi Field 2013-07-21 17:15:05 407833 543744 502517 9 phi 143 NL 3289 Citi Field 2013-07-21 17:17:03 407833 543744 502517 1 phi 143 NL 3289 Citi Field 2013-07-21 17:21:34 456124 435623 400284 2 phi 143 NL 3289 Citi Field 2013-07-21 17:23:55 456124 435623 400284 playerId.3B playerId.SS playerId.LF playerId.CF playerId.RF batterPos batterName pitcherName runsOnPlay 6 431151 435560 458913 501571 407781 SS Rollins Harvey 0 7 431151 435560 458913 501571 407781 3B Young, M Harvey 0 8 431151 435560 458913 501571 407781 2B Utley Harvey 0 9 431151 435560 458913 501571 407781 LF Brown, D Harvey 0 1 276545 276519 502126 460055 430321 LF Young, E Lee, Cl 0 2 276545 276519 502126 460055 430321 2B Murphy, Dn Lee, Cl 0 startOuts runsInInning runsITD runsFuture start1B start2B start3B end1B end2B end3B outsInInning startCode 6 0 0 0 0 <NA> <NA> <NA> <NA> <NA> <NA> 3 0 7 1 0 0 0 <NA> <NA> <NA> <NA> <NA> <NA> 3 0 8 2 0 0 0 <NA> <NA> <NA> 400284 <NA> <NA> 3 0 9 2 0 0 0 400284 <NA> <NA> <NA> <NA> <NA> 3 1 1 0 2 0 2 <NA> <NA> <NA> <NA> <NA> <NA> 3 0 2 1 2 0 2 <NA> <NA> <NA> <NA> <NA> <NA> 3 0 endCode fielderId gameId isPA isAB isHit isBIP our.x our.y r theta 6 0 407781 gid_2013_07_21_phimlb_nynmlb_1 TRUE TRUE FALSE TRUE 119.01855 283.65796 307.6154 1.173521 7 0 431151 gid_2013_07_21_phimlb_nynmlb_1 TRUE TRUE FALSE TRUE -53.88154 88.22197 103.3747 2.119083 8 1 NA gid_2013_07_21_phimlb_nynmlb_1 TRUE FALSE FALSE FALSE NA NA NA NA 9 0 NA gid_2013_07_21_phimlb_nynmlb_1 TRUE TRUE FALSE FALSE NA NA NA NA 1 0 276519 gid_2013_07_21_phimlb_nynmlb_1 TRUE TRUE FALSE TRUE -46.34461 115.77418 124.7056 1.951563 2 0 502126 gid_2013_07_21_phimlb_nynmlb_1 TRUE TRUE FALSE TRUE -31.32067 208.48835 210.8278 1.719909
There is a great deal of information collected here — it should be comparable to Retrosheet. We can do some simple things like pull out the scoring plays:
subset(gd$ds, runsOnPlay > 0, select="description")
description 3 Play reviewed and stands as called: David Wright homers (15) on a line drive to left center field. 4 Marlon Byrd homers (17) on a fly ball to left field. 26 Play reviewed and overturned: Juan Lagares homers (2) on a line drive to left center field. David Wright scores. Josh Satin scores.
compute a linescore for the game:
require(plyr) ddply(gd$ds, ~inning, summarise, PHI = sum(ifelse(half == "top", runsOnPlay, 0)), NYM = sum(ifelse(half == "bottom", runsOnPlay, 0)))
inning PHI NYM 1 1 0 2 2 2 0 0 3 3 0 0 4 4 0 3 5 5 0 0 6 6 0 0 7 7 0 0 8 8 0 0 9 9 0 0
Or the final totals:
ddply(gd$ds, ~half, summarise, PA = sum(isPA), R = sum(runsOnPlay), H = sum(isHit))
half PA R H 1 bottom 32 5 7 2 top 32 0 4
How about the basic pitching lines:
ddply(gd$ds, ~pitcherId, summarise, Name = pitcherName[1], BF = sum(isPA), IP = sum(endOuts - startOuts)/3, H = sum(isHit), R = sum(runsOnPlay), BB = length(grep("Walk", event)), SO = length(grep("Strikeout", event)), HR = length(grep("Home Run", event)))
pitcherId Name BF IP H R BB SO HR 1 424324 Lee, Cl 26 6 7 5 1 6 3 2 425786 Atchison 7 2 1 0 0 2 0 3 449097 Papelbon 3 1 0 0 0 1 0 4 455374 Bastardo 3 1 0 0 0 2 0 5 518774 Harvey 25 7 3 0 0 10 0
This was not a bad day for Mr. Harvey. Now, you may see some discrepancies between the data that you download through openWAR and more authoritative sources. But based on our analysis, the fidelity of the data retrieved by openWAR is very good. We’ll verify this statement in a later post.
Clearly, there are a lot more interesting things that one can do with this data, but this is just a basic introduction. Next time we will explore openWAR‘s ability to download multiple games worth of information.
Thanks for this. Right now i’m thinking about scraping spring training data but I’m getting a funny error when installing the package ending with:
Quitting from lines 61-62 (intro.Rmd)
Error: processing vignette ‘intro.Rmd’ failed with diagnostics:
could not find function “bbplot”
Execution halted
Error: Command failed (1)
Jared, thanks for pointing this out. “bbplot” was a deprecated function that has been replaced by “plot”. I just committed a fix for this, so if you install the package again it should work now.
This looks like an exciting package! Thanks for all the effort! I’m trying to install this for R3.03 on a windows 7 system, but received the following
“In addition: Warning message:
package ‘Sxslt’ is not available (for R version 3.0.3) “. I get the same error when attempting this on an R3.02 installation.
Any ideas?
Dennis, I think you’d get that message if you tried to install Sxslt from CRAN. Did you use:
install.packages(“Sxslt”, repos = “http://www.omegahat.org/R”, type = “source”)
Ben, many thanks, I’m up and running!
Hi, thanks for the work on this. I was just looking through the code on github, and it’s impressive. (As far as I can tell — I’m very amateurish with R. But what I can read looks careful and thorough.)
I had one question about the code, and reliance on timestamps. For example, lines 395-396 in GameDay.R are this comment:
# IMPORTANT: Have to sort the data frame just in case
# Have to sort by timestamp here, NOT by ab_num!
(And then of course, the sorting is done.) Does it ever happen that Gameday ab_num are out of order? And how reliable are the timestamps? I was looking at one game, gid_2014_04_06_minmlb_clemlb_1, at bat number 36, and it looks strange to me. Brian Dozier stole 2b on the second pitch, but the action timestamp for the steal is 182959, which is before even the first pitch (183203). The same is true for the final pitch (183111).
I wondered if perhaps the pitches (at least) were out of order but had the correct timestamp, but that would make no sense — the lastest timestamp is the penultimate pitch, but that’s a foul ball, and the ab ended in a K.
tl;dr: I’m wondering how reliable these timestamps are, and if I’m just making a dumb mistake in how I’m reading the file. Anyone who knows more about this, I’d appreciate the info.
What version of R are you using to run openWAR? I get the error message “package ‘Sxslt’ is not available (for R version 3.1.2)
Seconded Andy. I’m receiving the same error 3.1.1
Does this solve your issue?
https://github.com/beanumber/openWAR/issues/23
Does it have anything to do with the fact that I’m using windows?
Session Info:
> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Error message:
> install.packages(‘Sxslt’, repos = ‘http://www.omegahat.org/R’, type = ‘source’)
Installing package into ‘C:/~/Documents/R/win-library/3.1’
(as ‘lib’ is unspecified)
trying URL ‘http://www.omegahat.org/R/src/contrib/Sxslt_0.91-3.tar.gz’
Content type ‘application/x-gzip’ length 200341 bytes (195 Kb)
opened URL
downloaded 195 Kb
* installing *source* package ‘Sxslt’ …
Please define LIB_XSLT
Warning: running command ‘sh ./configure.win’ had status 1
ERROR: configuration failed for package ‘Sxslt’
* removing ‘C:/~/Documents/R/win-library/3.1/Sxslt’
Warning in install.packages :
running command ‘”C:/PROGRA~1/R/R-31~1.1/bin/x64/R” CMD INSTALL -l “C:\~\Documents\R\win-library\3.1” C:\~\AppData\Local\Temp\RtmpKKHayX/downloaded_packages/Sxslt_0.91-3.tar.gz’ had status 1
Warning in install.packages :
installation of package ‘Sxslt’ had non-zero exit status
The downloaded source packages are in
‘C:\~\AppData\Local\Temp\RtmpKKHayX\downloaded_packages’