Analyzing Baseball Data with R

Some information about the book Analyzing Baseball Data With R, 2nd edition by Max Marchi, Jim Albert, and Ben Baumer:

Some useful links for the book.

 

150 responses

  1. Jim,

    I hope all is well. I have continued working through the text and just have a question regarding Chapter 5, Exercise 3.

    Is there anyway you can confirm the Runs Value for Rickie Weeks and Michael Bourn. I have 9.029 for Weeks and 7.360 for Bourn. I just want to be sure I am following along with the code properly. I used your code from the in Chapter Exercises with Pujols.

    Thanks so much,
    Lloyd

    1. Lloyd:

      Here’s what I have for Exercise 3 of Chapter 5:

      d2016 %>% filter(BAT_ID %in% c(“eatoa002”, “marts002”),
      BAT_EVENT_FL == TRUE) %>%
      group_by(BAT_ID) %>%
      summarize(N = n(),
      M = mean(run_value),
      S = sum(run_value))
      ## # A tibble: 2 x 4
      ## BAT_ID N M S
      ##
      ## 1 eatoa002 706 0.0188 13.3
      ## 2 marts002 529 0.0179 9.48

      Hope this helps.

      Jim

  2. Hi Jim,

    Hope all is well. I am working through Chapter 4 Exercise 3 (Manager Effect in Baseball) and ran into an issue running the solution. I get the following error running the code:
    Error: Problem with `summarise()` input `Mean_Residual`.
    x object ‘.resid’ not found
    i Input `Mean_Residual` is `mean(.resid)`.
    i The error occurred in group 1: playerID = “actama99”.

    It seems that for whatever reason the Augment function is not adding the .resid column. Instead I only get the following:
    > out out
    # A tibble: 345 x 10
    yearID teamID R RA .fitted .hat .sigma .cooksd .std.resid playerID

    I am using the solutions dated 1/10/2019.

    Any help or guidance on what is going wrong would be greatly appreciated.

    Thanks,
    Brandon

    1. Hi Brandon:

      It appears that the broom package has changed what happens with the augment() function. I haven’t looked it carefully, but I see that the data frame out has the variable .std.resid instead of .resid. So I think if you replace mean(.resid) with mean(.std.resid) it should work fine.

      I’ll make a correction on those solutions.

      Thanks.

      Jim

      1. Thanks Jim! Greatly appreciate the quick response. Loving the book!

  3. In section 6.2.3 and getting below error. Package missing driving this?

    count_plot %+% run_value_by_count +
    + scale_fill_gradient2(“xRV”, low = grey10, high = crcblue,
    + mid = white)
    Error in count_plot %+% run_value_by_count :
    is.character(lhs) is not TRUE

    1. Robert, I just tried running that Chapter 6 from the script posted on our Github site and I couldn’t reproduce your error. We are using the tidyverse suite of packages, but nothing else. Unfortunately, without having your computer in front of me, I am not sure what is creating the issue. Sorry not to be of more help. Jim

  4. Can someone help me figure out how to make sense of this book? Chapter 2 is allegedly “Introduction to R,” and yet I can’t seem to find any instructions on how to actually access/import the data necessary to work alongside each exercise.

    1. All of the datasets described in the book are found in the data folder at https://github.com/beanumber/baseball_R. Also there is a package ABSRdata that contains most of the datasets. You can install this package by following the instructions at https://github.com/bayesball/ABWRdata

      Good luck — I know it can be challenging to get started.

  5. This book has been a great help, but I’ve got stuck in section 3.7.1. I don’t believe I have the same type as the person above, but any help is appreciated. Here is what I’m typing:

    get_birthyear<-function(Name){
    Names%
    filter(nameFirst==Names[1],
    nameLast==Names[2])%>%
    mutate(birthyear=ifelse(birthMonth>=7,
    birthYear+1,birthYear),
    Player=paste(nameFirst,nameLast))%>%
    select(playerID,Player,birthyear)
    }

    It seems to me like this part of the code isn’t working. When I go to the next steps in the chapter for setting up the table, it comes up with 0 observations of 3 variables. I’ve read the Lahman database into R, so I’m not exactly sure what’s not connecting. Thanks!

    1. Nick:
      It seems that you didn’t type in the get_birthyear() function correctly — it should be as I’ve copied below.
      By the way, all of the code for the chapters can be found at. https://github.com/beanumber/baseball_R/tree/master/chapter_code
      Best: Jim

      get_birthyear <- function(Name) {
      Names %
      filter(nameFirst == Names[1],
      nameLast == Names[2]) %>%
      mutate(birthyear = ifelse(birthMonth >= 7,
      birthYear + 1, birthYear),
      Player = paste(nameFirst, nameLast)) %>%
      select(playerID, Player, birthyear)
      }

  6. Can anyone explain this message when trying to install a package: “Do you want to install from sources the package which needs compilation?”

    Thanks in advance.

    1. Robert, there are two ways to install packages, precompiled and those that consist of source programs (like C++ or Fortran) that need to be compiled. Most packages come precompiled, but if you have C++ and Fortran on your computer, you can compile the source packages. Usually the need-to-be compiled packages are the ones that are recently released — if you wait a few days, they will be available in the precompiled version.

  7. Excellent book, it is a great tool for the study of statistics, and to make every baseball game even more interesting. I have been trying to follow the example of the Pythagorean expectation formula and how to obtain its exponent (pages 94-102), but I wonder what to do when in the final score a team made no runs, that is, ended with zero runs. In such cases, for example, for the calculation of logRratio (page 95), log(0) is Inf.
    > log(0)
    [1] -Inf
    What is the way to deal with this data: remove this log from the data, or calculate, for example, log(0.1)?
    Thanks

    1. Alfredo, thanks for the kind comments on the book. Usually we apply the Pythagorean expectation formula for a collection of games where the runs scored for and against are positive. I don’t think we use it for a single game.

      1. Thanks a lot for you response.

      2. What I was trying to do, as a statistical exercise, is to apply the formula to several games in a season to observe how the expectation of games won changes over the dates. My intention was to find the time when the prediction most closely matched the final outcome. I am looking at a short 49-game season in my country’s professional league. Currently, they are going through game 20. My first choice was to think of a sample size for all 49 games. But doing the exercise by dates, I plan to find the point at which the prediction was most closely matched. Kindly, I would like to ask if you can think of any recommendations for my exercise? Thank you.

  8. I apologize as I am only on chapter 2 but already having a problem while following along with Warren Spahn’s csv. As I run this code, I get to
    install.packages(“tidyverse”)
    library(tidyverse)
    library(Lahman)
    getwd()
    spahn <- read_csv("data/spahn.csv")
    while read.csv works for getting data/sphan.csv, read_csv produces this error
    Error in `vec_as_location()`:
    ! `…` must be empty.
    x Problematic argument:
    * call = call
    Run `rlang::last_error()` to see where the error occurred.

    I am fairly sure my working directory is set up correctly so I thought it might be something else wrong. I am completely lost and am just getting back to coding so any help would be greatly appreciated. Thank you!

    1. Justin, sorry, but there is no simple answer to that particular error message. I can’t reproduce that on my laptop. I’d suggest using read.csv() instead of read_csv() to read in data files — both functions do the same thing. Jim

  9. Hello!
    I was wondering how to get the hofbaseball.csv file to use for Chapter 3? I know it wasn’t in the original files when I download the csv files for 2017, but I can’t seem to figure out how to get a hold of it?

    1. In Chapter 3, I don’t believe there is a hopbaseball.csv file mentioned, but there is hofbatting.csv and hofpitching.csv. These two files are available in the data folder on Github https://github.com/beanumber/baseball_R/tree/master/data

  10. Jayson Stancil | Reply

    Hi,
    I recently bought the book, and love the work you’ve done. However, I am running into many issues becasue I believe the latest version of the Lahman package in R removed the Master table. Is there a work around to this, or am I doing something incorrect?

    1. Jayson, in the Lahman package, that Master data frame has been renamed as the People data frame. JIm

      1. Jayson Stancil

        You’re a lifesaver. Thanks again!

  11. Hi Jim!

    Love the book!

    I wanted to ask for some clarification on the first “Baseball Question” in section 1.2.8.

    What I wanted to ask was how the “average number of home runs per game recorded in each decade” is calculated. Specifically, how are you obtaining the values 0.3, 0.8, and 2.2 in the paragraph below the question?

    My approach was to use the Teams dataset and group_by the variable year_id. I thought that “home runs per game” could be calculated by taking the total number of home runs and dividing by the total number of games. But my results didn’t match what you had.

    Thanks!

    1. Addison:

      What you did was fine. But you were computing the average number of home runs per team per game. Since two teams are playing, the average number of home runs would be double of what you are finding.

      Jim

      1. Hi Jim, Thanks for the quick reply!

        Is this correct, or do I need to further group by decade?

        View(
        Teams %>%
        group_by(yearID) %>%
        summarise(total_games = sum(G),
        total_hr = sum(HR),
        hr_pg = 2 * total_hr / total_games)
        )

Leave a reply to Rob Carden Cancel reply