Where’s the Data?

Since we are in the beginning of a new season, I thought it might be helpful to review some of the different data sources that are currently available for baseball.  I will focus on the sources that I have found most useful in my own work.

Lahman Database

A great source of season-by-season baseball data is the Lahman database maintained by Sean Lahman.  For my work, I download the files in csv format, although other data formats are available.  Besides the popular Batting, Pitching, and Master datafiles, there are files on playoff games, Hall of Famers, Teams, and salaries.  There is a Lahman R package that contains this data, although I am not sure it is updated for the 2016 season.

Retrosheet Data

Retrosheet is a grassroots movement to “founded in 1989 for the purpose of computerizing play-by-play accounts of as many pre-1984 major league games as possible”.  If you have been reading my blog, you’ll know that I regard this as one of the best sources of data broken down by the game or play level.  Here is a post where I describe the process of downloading the Retrosheet play-by-play data into R.

PitchFX Data

PitchFX is a tracking system that collects data about each pitch in baseball that has been available since 2006.  The R package pitchFX by Carson Sievert allows one to scrape PitchFX data for particular days of interest.  This is rich data allowing one to compare pitchers with respect to pitch speed, pitch type, breaks, location, and outcome.  I have demonstrated the use of this data for a number of posts.  I’ve tried this scraping recently and it seems to work fine.


Sean Forman’s Baseball-Reference site is a “complete source for current and historical baseball players, teams, scores and leaders.”  It can be a viewed as an easily viewable version of much of the Retrosheet data, but it has much more, such as win probability graphs for every game in baseball history.  The format of this site has been recently updated.  One nice feature from a data perspective is that one can “share” data in a number of different formats, such as excel or csv, which makes for easy import into R.


FanGraphs is a large site containing articles and a vast array of statistics for past and present baseball players and teams.  If you search for 2016 Batting Leaders, for example, you’ll see a wide range of statistics divided into “Standard”, “Advanced”, “Batted Ball”, “Win Probability”, “Pitch Type”, “Pitch Value”, “Plate Discipline”, “Value”, “PitchFX”.  There are many interesting measures described on the FanGraphs Glossary section.  It appears that much of the data can be downloaded that facilitates easy import into R.  I am working on a new book on “baseball graphs” and I plan on devoting a chapter on the all of the new batting and pitching measures illustrated in FanGraphs.

Baseball Savant

Quoting from their site, BaseballSavant  “is a site dedicated to providing player matchups, Statcast metrics, and advanced statistics in a simple and easy-to-view way.”  This site provides a window into the new Statcast data that collects information about the location and movement of the ball and every player on the field.  It appears to be a work in progress so I would expect that different and new types of Statcast data would be available in the future.

R Baseball Packages

I have already mentioned several R packages, specifically Lahman that contains the Sean Lahman database, and pitchRX that allows one to easily scrape pitchFX data.  I should also mention the openWAR package by Ben Baumer and Greg Matthews.  This allows one to easily scrape play-by-play data for current games from the MLBAM GameDay files.  The purpose of this package was to provide an “open” source for performing run value and WAR calculations for baseball plays.  Also the newer baseballr package by Bill Petti was written to facilitate the downloading of baseball data from FanGraphs and BaseballSavant.

Any Other Good Sites?

The purpose of this post was to give a quick snapshot on the baseball data sources that I have found most useful in my work and have illustrated in the posts on this site.  As we know, things are always changing fast.  Let me know if you there are other good sources of data that you think I should mention in future posts.











