ABWRdata Package to Accompany Analyzing Baseball Data with R

Getting Started with R and Baseball Data

I get emails from people who are interested in learning about the use of R to explore data but they appear to struggle loading in the datasets described in our Analyzing Baseball Data with R book. We do make all of our datasets and R scripts available at https://github.com/beanumber/baseball_R but some people appear have issues making the datasets accessible in R’s working directory.

In thinking how to help these people, I thought about the attractiveness of the Lahman package containing all of the Lahman datasets. The Lahman package is really the simplest way of showing students quickly a large collection of season stats that one can explore to address interesting questions.

The ABWRdata Package

Anyway, this week I created a new R package ABWRdata containing many of the datasets described in our text. Due to size limitations, I couldn’t put all of the datasets in this package, but it contains many of the datasets that we use in the different chapters. Specifically, it gives a wide perspective of the type of baseball datasets that are currently freely available to the baseball researcher. I purposely don’t include any of the Lahman datasets since they are already available by the Lahman package.

One can install the package by typing at the RStudio Console window:


What Datasets are Included in the Package?

  • ALbatting, NLbatting, NLpitching — these are team batting and pitching datasets from the 2011 season
  • gl2011 — this contains Retrosheet data for all games in the 2011 season
  • all1998 — this contains Retrosheet data for all plays in the 1998 season
  • hofbatting, hofpitching — this contains career statistics for HOF non-pitchers and pitchers
  • statcast2017 — this contains Statcast data on a random subset of 100K pitches from the 2017 season
  • dimaggio_1941 and williams_1941 contains game-to-game batting stats for Joe DiMaggio and Ted Williams for the historic 1941 season — this data is used in Chapter 10 of our book
  • spahn — contains season pitching stats for the HOF pitcher Warren Spahn described in Chapter 2 of our book

Once you load in the package, all of these datasets are immediately available (so-called lazy loading). For example, in the three lines below, (1) I load in the ABWRdata package, (2) I use the rle() function to find the lengths of all hitting slumps and hitting streaks in DiMaggio’s 1941 season, and (3) I display all hitting streaks to confirm that he did indeed have a 56-game hitting streak that season.


The idea behind this package was to provide an easier entry into the world of baseball data and performing one’s own baseball studies using R. If you find that this package is helpful in getting access to this data, let me know by emailing me at albert@bgsu.edu. If one person is appreciative, then it makes it worthwhile.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: