In this blog, we’ve talked about a number of helpful R packages for doing baseball work. Notable packages include
- The
Lahman
package that provides all the season-to-season stats for teams, pitchers, and batters. - The
pitchRx
package for scraping pitchFX data. - The
openWAR
package for scraping MLBAM GameDay data and computing WAR statistics.
Bill Petti has a new package baseballr
that is available on his github site. In this blog post, I’ll give a first look in this package, explaining how it extends the functionality of the earlier packages in one’s analysis of baseball data.
Scraping FanGraphs Data
One can download FanGraphs data directly from the website. But the baseballr
package has some functions for directly importing FanGraphs data in R. For example,
library(baseballr) d <- fg_bat_leaders(2016, 2016)
will download batting data for all batting leaders with qualifying at-bats. If you want data for all batters, say with 100 or more AB, you use
d <- fg_bat_leaders(2016, 2016, qual=100)
This is attractive since you get a ton of different batting measures included the Advanced, Batted Ball, Pitch Type, Plate Discipline variables in FanGraphs.
Also there are other FanGraphs functions: fg_guts()
downloads wOBA and FIP constants and coefficients (this is useful for my own work), fg_park()
downloads park factors, and fg_park_hand()
is supposed to download park factors for each batting side (this didn’t seem to work for me).
Downloading Baseball Savant Data
There are also functions for downloading StatCast data available through Baseball Savant into R. For example, suppose I want to collect batting data for the 2016 Mike Trout. I first find Trout’s MLBAM ID (the baseballr
function playerid_lookup()
is helpful for this), and then I type
S <- scrape_statcast_savant_batter("2016-04-01", "2016-09-30", batterid = 545361)
This gives pitch-by-pitch stats for each pitch thrown to Mike Trout for the 2016 season. This includes all of the pitchFX data (speed of pitch, break, location, etc) and some StatCast variables such as hit speed and hit angle for the balls put into play.
Here I graph the hit angle against the speed of the bat for all of Trout’s balls put in play. There is a new variable “barrel” where MLB indicates by a 1 (otherwise 0) if it is a well-struck ball. We see from the graph that barrel = 1 corresponds approximately to a hit speed of 100 mpg where the hit angle is not too small.
Computation of Statistics
There are also a number of functions for computing various baseball measures. One function I see as being immediately useful is woba_plus()
that computes the wOBA measure given basic counts data. To illustrate, I use the Lahman
package to collect basic season statistics for the 2015 hitters with at least 400 at-bats. Then I use the woba_plus()
function on the data frame. I output the head of the result — I see the top 6 hitters with respect to the wOBA statistic for the 2015 season.
library(Lahman) df2015 <- summarize(group_by(filter(Batting, yearID == 2015), playerID), season = first(yearID), AB = sum(AB), uBB = sum(BB), X1B = sum(H - X2B - X3B - HR), X2B = sum(X2B), X3B = sum(X3B), HBP = sum(HBP), SF = sum(SF), HR = sum(HR), SH = sum(SH), SO = sum(SO)) df2015_q <- filter(df2015, AB >= 400) woba <- woba_plus(df2015_q) head(select(woba, playerID, wOBA, wOBA_CON)) # A tibble: 6 × 3 playerID wOBA wOBA_CON <chr> <dbl> <dbl> 1 harpebr03 0.466 0.554 2 vottojo01 0.433 0.485 3 goldspa01 0.429 0.517 4 cabremi01 0.421 0.462 5 troutmi01 0.420 0.519 6 cruzne02 0.400 0.511
Here’s a histogram of the 2015 wOBAs — Mike Trout is the one outlier at the high end.
Other Statistics?
There are other functions for computing other statistics such as the percentage of pitches thrown to different edges of the strike zone, fip metrics, and measures of team consistency. This is a package under development, so I suppose new functions will be added in the future.
Summing Up
On my first look, I think the baseballr
package is a nice addition to what is currently available for baseball work in R. It allows easy scraping of FanGraphs and Baseball Savant data which will facilitate interesting explorations. Also I like that Bill is making his work available and open for inspection by adding these extra computation functions. I encourage other folks to create R packages — the process for building a R package in RStudio is easier with the RStudio interface — it might be worth posting about this in the future.