Analyzing Baseball Data with R

Some information about the book Analyzing Baseball Data With R, 2nd edition by Max Marchi, Jim Albert, and Ben Baumer:

Some useful links for the book.


140 responses

  1. Jim,

    I hope all is well. I have continued working through the text and just have a question regarding Chapter 5, Exercise 3.

    Is there anyway you can confirm the Runs Value for Rickie Weeks and Michael Bourn. I have 9.029 for Weeks and 7.360 for Bourn. I just want to be sure I am following along with the code properly. I used your code from the in Chapter Exercises with Pujols.

    Thanks so much,

    1. Lloyd:

      Here’s what I have for Exercise 3 of Chapter 5:

      d2016 %>% filter(BAT_ID %in% c(“eatoa002”, “marts002”),
      BAT_EVENT_FL == TRUE) %>%
      group_by(BAT_ID) %>%
      summarize(N = n(),
      M = mean(run_value),
      S = sum(run_value))
      ## # A tibble: 2 x 4
      ## BAT_ID N M S
      ## 1 eatoa002 706 0.0188 13.3
      ## 2 marts002 529 0.0179 9.48

      Hope this helps.


  2. Hi Jim,

    Hope all is well. I am working through Chapter 4 Exercise 3 (Manager Effect in Baseball) and ran into an issue running the solution. I get the following error running the code:
    Error: Problem with `summarise()` input `Mean_Residual`.
    x object ‘.resid’ not found
    i Input `Mean_Residual` is `mean(.resid)`.
    i The error occurred in group 1: playerID = “actama99”.

    It seems that for whatever reason the Augment function is not adding the .resid column. Instead I only get the following:
    > out out
    # A tibble: 345 x 10
    yearID teamID R RA .fitted .hat .sigma .cooksd .std.resid playerID

    I am using the solutions dated 1/10/2019.

    Any help or guidance on what is going wrong would be greatly appreciated.


    1. Hi Brandon:

      It appears that the broom package has changed what happens with the augment() function. I haven’t looked it carefully, but I see that the data frame out has the variable .std.resid instead of .resid. So I think if you replace mean(.resid) with mean(.std.resid) it should work fine.

      I’ll make a correction on those solutions.



      1. Thanks Jim! Greatly appreciate the quick response. Loving the book!

  3. In section 6.2.3 and getting below error. Package missing driving this?

    count_plot %+% run_value_by_count +
    + scale_fill_gradient2(“xRV”, low = grey10, high = crcblue,
    + mid = white)
    Error in count_plot %+% run_value_by_count :
    is.character(lhs) is not TRUE

    1. Robert, I just tried running that Chapter 6 from the script posted on our Github site and I couldn’t reproduce your error. We are using the tidyverse suite of packages, but nothing else. Unfortunately, without having your computer in front of me, I am not sure what is creating the issue. Sorry not to be of more help. Jim

  4. Can someone help me figure out how to make sense of this book? Chapter 2 is allegedly “Introduction to R,” and yet I can’t seem to find any instructions on how to actually access/import the data necessary to work alongside each exercise.

    1. All of the datasets described in the book are found in the data folder at Also there is a package ABSRdata that contains most of the datasets. You can install this package by following the instructions at

      Good luck — I know it can be challenging to get started.

  5. This book has been a great help, but I’ve got stuck in section 3.7.1. I don’t believe I have the same type as the person above, but any help is appreciated. Here is what I’m typing:


    It seems to me like this part of the code isn’t working. When I go to the next steps in the chapter for setting up the table, it comes up with 0 observations of 3 variables. I’ve read the Lahman database into R, so I’m not exactly sure what’s not connecting. Thanks!

    1. Nick:
      It seems that you didn’t type in the get_birthyear() function correctly — it should be as I’ve copied below.
      By the way, all of the code for the chapters can be found at.
      Best: Jim

      get_birthyear <- function(Name) {
      Names %
      filter(nameFirst == Names[1],
      nameLast == Names[2]) %>%
      mutate(birthyear = ifelse(birthMonth >= 7,
      birthYear + 1, birthYear),
      Player = paste(nameFirst, nameLast)) %>%
      select(playerID, Player, birthyear)

  6. Can anyone explain this message when trying to install a package: “Do you want to install from sources the package which needs compilation?”

    Thanks in advance.

    1. Robert, there are two ways to install packages, precompiled and those that consist of source programs (like C++ or Fortran) that need to be compiled. Most packages come precompiled, but if you have C++ and Fortran on your computer, you can compile the source packages. Usually the need-to-be compiled packages are the ones that are recently released — if you wait a few days, they will be available in the precompiled version.

  7. Excellent book, it is a great tool for the study of statistics, and to make every baseball game even more interesting. I have been trying to follow the example of the Pythagorean expectation formula and how to obtain its exponent (pages 94-102), but I wonder what to do when in the final score a team made no runs, that is, ended with zero runs. In such cases, for example, for the calculation of logRratio (page 95), log(0) is Inf.
    > log(0)
    [1] -Inf
    What is the way to deal with this data: remove this log from the data, or calculate, for example, log(0.1)?

    1. Alfredo, thanks for the kind comments on the book. Usually we apply the Pythagorean expectation formula for a collection of games where the runs scored for and against are positive. I don’t think we use it for a single game.

      1. Thanks a lot for you response.

      2. What I was trying to do, as a statistical exercise, is to apply the formula to several games in a season to observe how the expectation of games won changes over the dates. My intention was to find the time when the prediction most closely matched the final outcome. I am looking at a short 49-game season in my country’s professional league. Currently, they are going through game 20. My first choice was to think of a sample size for all 49 games. But doing the exercise by dates, I plan to find the point at which the prediction was most closely matched. Kindly, I would like to ask if you can think of any recommendations for my exercise? Thank you.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: