One of the first albums I purchased back in the 1960’s was Sgt. Peppers Lonely Hearts Club Band by the Beatles and the first song of this album begins with “It was 20 years ago today …”. For this blog 20 is a special number since it is the 20th anniversary of two things relevant to exploring baseball data with R. It has been 20 years since the release of version 1.0 of R and 20 years since the beginning of the Baseball Reference site. It seems appropriate to provide some personal comments on the 20th anniversary of these two events, reflecting on the history of statistical software and the availability of baseball data.
The Beginning of Interactive Statistical Computing
When I started at BGSU in 1979, statistical computation looked very different from what we do today. The popular statistics packages such as SAS and SPSS operated in “batch mode”. One wrote code to implement one’s statistical analysis, you submitted the code to the computer, and sometime later, you would get your output. This was okay, but it had several drawbacks. It was inefficient in the sense that one small mistake in your coding would cause your analysis to be junk and it could take many iterations of the batch process to get reasonable results.
But in the early 1980’s, I was introduced to the statistical language S developed at Bell Labs and available for UNIX computers. This was different — I now had the opportunity to interactively code with data. It was easy to change one’s analysis after removing outliers, fit and compare a variety of model, and adjust graphics parameters to get suitable graphs. Here is a snippet of some S work that I did in the 1980’s — all of these commands will work in the current release of R. (You’ll also note that we didn’t have monitors yet — the output was typed on lined paper.)
The Beginnings of R
As most of you know, R was developed by two statisticians, Rob Gentleman and Ross Ihaka as an open-source implementation of the S language. When I first learned about R, I thought of it as a poor-man’s version of S, but obviously it is much more than that. It really is the premier language for doing data science which including the process of importing, cleaning, manipulating, visualizing, and learning from data.
Here are some remarkable features of R.
- It is a complete statistical system in the sense that all traditional and new statistical methods are available in the base R package or one of the many user-contributed packages.
- It produces publication-quality output in tables and graphs.
- The user interface has continually improved with the developments of RStudio. One of the most exciting developments is in R Markdown documents which provides a means of integrating R work with text. We all know that textbooks are expensive. But using the bookdown package, one can produce a free online book available to all students.
- R is open source software freely available to anyone. And there is a large community of R users and sites for providing tutorials and assistance.
Early Days of Baseball Data
I remember the early days of baseball statistics where the best source of data was a thick book like Total Baseball that currents sits in my office.
While we were experiencing advances in statistical computational languages, there was a parallel development in the availability of baseball data in electronic form. Sean Lahman made available season to season data for teams and teams, and Retrosheet was a remarkable grass-roots movement to make game logs and play-by-play files available for recent seasons. Although these datasets were available to researchers, the raw data was a bit messy and not accessible to the general baseball fan.
Sean Foreman and Baseball Reference
Sean Foreman started Baseball Reference (B-R) twenty years ago with the general purpose of providing ready-to-use and complete statistics for the baseball fan. I can relate to Sean since we both got doctorates at midwestern schools in math-related fields, we both have connections with Philadelphia, but Sean was brave to leave his academic position at St. Josephs to work on Baseball Reference. I see Baseball Reference as providing an attractive interface to the mass of data available in the Lahman and Retrosheet databases. It is a remarkable well-designed site — it is the top place for learning about baseball players, teams, games and playoffs. To briefly illustrate its usefulness, I remember watching on TV Jim Bunning’s perfect game pitched on Father’s Day in 1964. I can quickly reminse about this particular game by finding the corresponding Baseball Reference page.
From this B-R page, I learn some interesting things about Jim Bunning’s perfect game.
- It was a quick game by current standards — only 2 hours and 19 minutes.
- I had forgotten that Tracy Stallard lost this game. Stallard is also known for allowing Roger Maris’ 61st home run in 1961.
- Bunny concluded this game with two strikeouts, both of the swinging variety.
- The B-R page provides a graph of the in-game win probabilities and shows the top five plays. We see the Phillies scored early and the game outcome was decided early — the probability the Phillies would win was 80% in the 6th inning.
- We can learn more about this game by clicking on the View Game Recap link on the B-R page — from this SABR article, we learn about a great play by Tony Taylor in the 5th inning that saved Bunning’s no-hitter.
Currently there are other useful baseball statistics sites such as FanGraphs and Baseball Savant. But I am sure Baseball Reference remains the main go-to site for fans for baseball data. It really provides a living history of baseball through boxscores and season statistics of players and teams. Sports Reference provides similar statistics for football (both pro and college), basketball (both pro and college), hockey and soccer. One nice feature of Sports Reference (including B-R) is that all displayed tables can be exported in csv (and other formats) for easy import into R or other programs. Actually, the use of these type of tables is a good way to get started exploring baseball data.
Looking to the Future
Obviously we have made great strides in the availability of sports data and the development of programming languages such as R that provide the tools to make sense and visualize this data. What does the future look like? I am sure it will be exciting. Here are some things I’d like to see.
- More sports articles that include the relevant data. Usually articles will focus on the results of a statistical exploration and provide little insight into the source of the data. Wouldn’t it be great if it was mandatory that all articles include a link to the data source?
- More visualization in sports stats sites. I think it is so much easier to tell baseball stories, such as the story of a player’s career, by the use of appropriate graphs. Can Baseball Reference add graphs to their pages? I do appreciate their graphs of win probabilities that accompany each game.
- More cloud computing. People who are learning R have issues in installing the relevant packages and reading in the data. Wouldn’t it be nice if you could just provide users a baseball cloud platform where all of the packages and datasets are already available? I’m starting to have students use the RStudio cloud environment and my initial impressions are positive.