Monthly Archives: September, 2020

Creating a Game Score Database

Introduction

Recently, the Atlanta Braves defeated the Miami Marlins by a remarkable 29-9 score. Reflecting on this blowout, I wondered about the general pattern of scores in Major League Baseball. What are typical or most likely baseball scores, and have these most likely scores changed over recent MLB history? What is the chance of a blowout (like this recent 29-9 game) and is it more likely to have a blowout in recent seasons?

To address these questions, we need a database of scores of MLB baseball games in recent seasons. (By the way, I won’t actually be creating a database here — instead I will be creating a large data frame with game scores from recent seasons.) A convenient source of boxscore data are the Retrosheet game logs available at https://www.retrosheet.org/gamelogs/. In this post, I’ll describe the use of a special R function, get_scores(), that will read in the results for all games for a particular season from the Retrosheet site. (This function is a slight modification of a function that we introduced in Chapter 11 of the 2nd edition of Analyzing Baseball Data with R.). By using one function from the purrr package, I can use this function to read and collect multiple seasons of game data. I’ll illustrate the use of my database to look for some unusual baseball scores and explore the pattern of average total runs scored and average margin of victory over the last 50 seasons.

Reading in Retrosheet Game Log Files

I start with the function load_gamelog() that is listed on page 254 of ABWR. The single input to this function is the year of the season of interest. This function will download the zip file from the Retrosheet game log web page, extract the text file and read this file into R. The names of the column headers are available from our book’s Github site. These names are imported and attached to the columns of the Retrosheet data.

In this application, I am only interested in the Retrosheet variables related to game score. I wrote a function get_scores() that is a wrapper for the load_gamelog() function. The get_scores() function downloads the Retrosheet data for a specific season and collects only the variables Date, HomeTeam, VisitingTeam, HomeRunsScore, VisitorRunsScored, and Season.

To get the score data for all games in the 2019 season, I just type:

d <- get_scores(2019)

The data frame d consists of 2429 rows where a row gives the scoring information for a particular game during the 2019 regular season. I want to collect scores for the past 50 seasons from 1970 through 2019. The map_df() function from the purrr package will apply the get_scores() function for each of the season inputs 1970, 1971, …, 2019 and paste the data frames together. The output df is a data frame of the games scores for the 110,325 regular season games played in the past 50 seasons.

library(purrr)
df <- map_df(1970:2019, get_scores)

Some Extreme Game Scores

Since baseball fans are interested in extreme performances, let us first focus on a couple of extreme game scores. What game in the past 50 seasons generated the most runs scored? In the following dplyr code, I define the variable Runs and extract the record where the variable Runs is maximized. We see from the output that the highest scoring game was on May 17, 1979 where the Phillies defeated the Cubs 23-22. (An interesting trivia item is that Mike Schmidt, one of my Philly heros, hit the winning home run in the 10th inning of this high-scoring game.)

df %>% 
   mutate(Runs = HomeRunsScore +
            VisitorRunsScored) %>% 
   filter(Runs == max(Runs)) 

      Date HomeTeam VisitingTeam HomeRunsScore
1 19790517      CHN          PHI            22
  VisitorRunsScored Season Runs
1                23   1979   45

What game was the largest blowout? Using similar syntax, I find the game information for the game where the margin of victory was the highest. From the output, we see on August 22, 2007, Texas defeated Baltimore 30-3 for a winning margin of 27 runs.

df %>% 
    mutate(Margin_Victory = abs(HomeRunsScore -
             VisitorRunsScored)) %>% 
    filter(Margin_Victory == max(Margin_Victory)) 
      Date HomeTeam VisitingTeam HomeRunsScore
1 20070822      BAL          TEX             3
  VisitorRunsScored Season Margin_Victory
1                30   2007             27

Some History of Game Scores

I was primarily interested in exploring the pattern of runs scored and winning margins over the past 50 seasons. For each season, I find

  • Total Runs = the mean total runs scored in a game
  • Win Margin = the mean winning margin in a game

Below I plot these mean Total Runs and mean Win Margin values as a function of the season and add a loess smoothing curve to see the patterns. We see …

  • For mean Total Runs, there is a gradual increase from 1970 to 1990, followed by a big increase until 2000, followed by a general decrease until 2014 and an increase during the Statcast era.
  • Mean Win Margin appears to show the same pattern of increase and decrease as mean Total Runs. If you think about this, this makes sense — higher scoring games will tend to have higher win margins. Statistically, you can say that mean Total Runs for a season is highly positively correlated with mean Win Margin.

Try the Code Out — Some Questions to Answer

On my Github Gist site, I show all of the R code for this exercise including the get_scores() function and the code to generate the output that I have displayed. This is a good exercise for the learning R user. Read in the code and try collecting game scores data for a single season, or for a small set of seasons.

Here are some questions that you can try to answer with this game score data.

  1. For a season like 2019, what was the most common score?
  2. the most common score over the past 50 seasons is 3-2. Has the most common score changed in the last 50 seasons?
  3. Define a blowout to be a game where the margin of victory is at least 10 runs. What fraction of 2019 games were blowouts? Has the fraction of blowouts changed over this 50 season period?
  4. What fraction of games are shutouts? Again, has the fraction of shutouts changed in recent seasons?