Getting Started with R and Baseball Data
I get emails from people who are interested in learning about the use of R to explore data but they appear to struggle loading in the datasets described in our Analyzing Baseball Data with R book. We do make all of our datasets and R scripts available at https://github.com/beanumber/baseball_R but some people appear have issues making the datasets accessible in R’s working directory.
In thinking how to help these people, I thought about the attractiveness of the Lahman package containing all of the Lahman datasets. The Lahman package is really the simplest way of showing students quickly a large collection of season stats that one can explore to address interesting questions.
The ABWRdata Package
Anyway, this week I created a new R package ABWRdata containing many of the datasets described in our text. Due to size limitations, I couldn’t put all of the datasets in this package, but it contains many of the datasets that we use in the different chapters. Specifically, it gives a wide perspective of the type of baseball datasets that are currently freely available to the baseball researcher. I purposely don’t include any of the Lahman datasets since they are already available by the Lahman package.
One can install the package by typing at the RStudio Console window:
What Datasets are Included in the Package?
- ALbatting, NLbatting, NLpitching — these are team batting and pitching datasets from the 2011 season
- gl2011 — this contains Retrosheet data for all games in the 2011 season
- all1998 — this contains Retrosheet data for all plays in the 1998 season
- hofbatting, hofpitching — this contains career statistics for HOF non-pitchers and pitchers
- statcast2017 — this contains Statcast data on a random subset of 100K pitches from the 2017 season
- dimaggio_1941 and williams_1941 contains game-to-game batting stats for Joe DiMaggio and Ted Williams for the historic 1941 season — this data is used in Chapter 10 of our book
- spahn — contains season pitching stats for the HOF pitcher Warren Spahn described in Chapter 2 of our book
Once you load in the package, all of these datasets are immediately available (so-called lazy loading). For example, in the three lines below, (1) I load in the ABWRdata package, (2) I use the rle() function to find the lengths of all hitting slumps and hitting streaks in DiMaggio’s 1941 season, and (3) I display all hitting streaks to confirm that he did indeed have a 56-game hitting streak that season.
The idea behind this package was to provide an easier entry into the world of baseball data and performing one’s own baseball studies using R. If you find that this package is helpful in getting access to this data, let me know by emailing me at firstname.lastname@example.org. If one person is appreciative, then it makes it worthwhile.