Over the years, I have found the
baseballr package very helpful in obtaining Statcast data from Baseball Savant. Recently, this package has been updated (version 1.5.0) to download baseball data from a variety of sources including some that I have not used. So I thought it would be helpful to illustrate some of the many useful functions in this package. Since the
baseballr package has many capabilities beyond just retrieving data, this might be the first of a number of posts describing the features of this package.
baseballr package allows convenient scraping of data from five sources: Baseball Reference, FanGraphs, MLB, Retrosheet and Statcast.
The home page of baseballr describes the process of installing the package from CRAN or installing the developmental version of the pacagefrom the authors’ Github site.
Baseball Reference Data
baseballr has a number of functions for downloading various data from the popular Baseball Reference site. For convenience, all of these BR functions are prefaced by “bref”. In particular, the
bref_daily_batter() function will download batting stats for all players between two dates in a season. For example, the following code will download all stats for games played between May 10 and June 20 of 2021.
brdata <- bref_daily_batter(t1="2021-05-10", t2="2021-06-20")
A similar function
bref_daily_pitcher() will download pitching stats for all pitcher who pitched between two dates.
brdata2 <- bref_daily_pitcher(t1="2021-05-10", t2="2021-06-20")
These functions retrieve the traditional batting and pitching stats for all players
The package also has a number of functions, prefaced by “fg” that download various data from the popular FanGraphs site.
Suppose, for example, that we’re interested in obtaining the game-by-game batting stats for Mike Trout for each game of the 2022 season. First we have to figure out the player id used by FanGraphs. The function
playerid_lookup() will obtain all of the player ids from the Chadwick Bureau’s public register, I extract the FanGraphs id, and then I will use this id as an input to the function
fg_batter_game_logs() that downloads the game-to-game stats for Trout.
id <- playerid_lookup("Trout", "Mike") fg_id <- id$fangraphs_id fg_data_trout <- fg_batter_game_logs(playerid = fg_id, year = 2022)
By the way, this returns a remarkable 239 variables collected about Trout for each of the 119 games he played this season.
Suppose I’m interested in stats for each of Aaron Nola’s starts during the 2022 season. I use similar functions to retrieve Nola’s FanGraphs id and then the function
fg_pitcher_game_logs() will retrieve the game-to-game pitching stats.
id <- playerid_lookup("Nola", "Aaron") fg_id <- id$fangraphs_id fg_data_nola <- fg_pitcher_game_logs(playerid = fg_id, year = 2022)
We have all of the interesting FanGraphs measures collected for each of Nola’s games. As an example, here is a histogram of the zone percentages for the 32 Nola starts. In most of Nola starts, he placed 40-50% of the pitchers within the zone.
Another useful function is
fg_batter_leaderboards() that will collect stats for all hitters across several seasons. Here I am collecting the FanGraphs batting measures for all “leaders” for the 2021 and 2022 seasons. The output data frame has 262 player-seasons and 289 batting measures.
b_leaders <- fg_batter_leaders(2021, 2022)
One noteworthy feature of the
baseballr package is the ability to download data from the MLB feed. A variety of functions are available, all starting with “MLB”. One function that was of interest to me was
mlb_pbp() which will retrieve pitch by pitch data for a minor or major league game of interest. To use this, one needs to know the game id
game_pk. The function
mlb_game_pks() will give the game_id values for all MLB games played on a particular day.
I figured out the
game_pk value for the spring training game on March 5 recently between the Mets and the Cardinals was 719263 and the
mlb_pbp() function will collect pitch-by-pitch data for this game.
d2 <- mlb_pbp(game_pk = 719263)
I am not completely familiar with all of the 148 variables in this data frame, but the
endTime variables give the starting and ending time for each pitch, so it would be straightforward to explore the effect of the new MLB pitch clock rule. Here are two rows of these two variables corresponding to two pitches.
d2[1:2, c("startTime", "endTime")] ── MLB Play-by-Play data from MLB.com ────── baseballr 1.5.0 ── ℹ Data updated: 2023-03-06 20:04:51 EST # A tibble: 2 × 2 startTime endTime <chr> <chr> 1 2023-03-05T20:31:34.676Z 2023-03-05T20:31:39.667Z 2 2023-03-05T20:31:15.772Z 2023-03-05T20:31:19.462Z
Retrosheet Play-by-Play data
I’ve written about downloading Retrosheet play-by-play data by using the Chadwick files and a special R function. The process is even easier using the function
retrosheet_data() in the
baseballr package. Suppose I’m interested in obtaining all the Retrosheet data for the seasons 2020 through 2022. Just type
d <- retrosheet_data(years_to_acquire = 2020:2022)
d is list with three elements corresponding to the three seasons of interest.
d[] (the first element of the list) is also a list with two components:
events component is our familiar Retrosheet data with an extra variable year indicating the value for season.
It is important to note that this function assumes that the user has already installed the Chadwick tools. I have a Mac Intel laptop and the Chadwick tools were installed, so this function worked.
As in earlier versions of
baseballr, the package allows easy access to Statcast data, but the function syntax has changed. Suppose you wish to acquire the data for the spring training games from March 5 to March 6 this season. Type
sc_data <- statcast_search(start_date = "2023-03-05", end_date = "2023-03-06", player_type = 'batter')
Also the functions
statcast_search_pitcher() allows accessing data for an individual batter or pitcher, respectively. The function
statcast_leaderboards() gives access to leaderboards published on Baseball Savant.
Try it Out!
baseballrpackage has been of interest to me for a number of years, but I believe the recent additions to the package look very exciting. The easy availability of baseball data from different sources makes it convenient for the reader to address any baseball question of interest. For example, now that I know that I can collect the time at the beginning and end of each pitch, I am interested in doing a study exploring the changes in lengths of baseball games due to the pitch clock rule.
- This package contains much more that these data acquisition functions and I likely will discuss other uses with these package functions in future posts.
- On my Github Gist site, you can find a R script including all of the examples in this post.
- For more information, I encourage the interested reader to visit the baseballr package home page.
I agree Jim! The baseballr package is very good right now.
I’ve use it in the past trying to create functions to visualize the stats that we can scrape with it.
Nevertheless, I confess I need to improve my R skills in order to polish my pull requests I’ve submitted to the package.
I’ve created some isolated functions to plot some baseball stats (you can see some examples at http://www.twitter.com/gmbeisbol).
In the case you find my vizes interesting and if you think a supporting hand could help you to include this kind of vizes in this blog, just drop me a message.
That would be great for me to improve my R skills and contribute to this important blog for baseball fans and data scientists in general.
Daniel: Sure, I’d be happy to have you contribute to the blog. The idea of showing a graphic to illustrate or answer some baseball question together with advice how to do the construction would be great. You can send me materials at firstname.lastname@example.org. Jim