Analyzing Baseball Data with R

Some information about the book Analyzing Baseball Data With R, 2nd edition by Max Marchi, Jim Albert, and Ben Baumer:

Some useful links for the book.

The official site at CRC Press.
The Amazon page for the book
The GitHub repository containing the datasets and the scripts used in the book.

150 responses

Lloyd Hill January 21, 2021 at 4:49 pm | Reply

Jim,

I hope all is well. I have continued working through the text and just have a question regarding Chapter 5, Exercise 3.

Is there anyway you can confirm the Runs Value for Rickie Weeks and Michael Bourn. I have 9.029 for Weeks and 7.360 for Bourn. I just want to be sure I am following along with the code properly. I used your code from the in Chapter Exercises with Pujols.

Thanks so much,
Lloyd
1. Jim Albert January 21, 2021 at 7:46 pm | Reply
  
  Lloyd:
  
  Here’s what I have for Exercise 3 of Chapter 5:
  
  d2016 %>% filter(BAT_ID %in% c(“eatoa002”, “marts002”),
  BAT_EVENT_FL == TRUE) %>%
  group_by(BAT_ID) %>%
  summarize(N = n(),
  M = mean(run_value),
  S = sum(run_value))
  ## # A tibble: 2 x 4
  ## BAT_ID N M S
  ##
  ## 1 eatoa002 706 0.0188 13.3
  ## 2 marts002 529 0.0179 9.48
  
  Hope this helps.
  
  Jim
Brandon Alfond February 19, 2021 at 7:40 pm | Reply

Hi Jim,

Hope all is well. I am working through Chapter 4 Exercise 3 (Manager Effect in Baseball) and ran into an issue running the solution. I get the following error running the code:
Error: Problem with `summarise()` input `Mean_Residual`.
x object ‘.resid’ not found
i Input `Mean_Residual` is `mean(.resid)`.
i The error occurred in group 1: playerID = “actama99”.

It seems that for whatever reason the Augment function is not adding the .resid column. Instead I only get the following:
> out out
# A tibble: 345 x 10
yearID teamID R RA .fitted .hat .sigma .cooksd .std.resid playerID

I am using the solutions dated 1/10/2019.

Any help or guidance on what is going wrong would be greatly appreciated.

Thanks,
Brandon
1. Jim Albert February 19, 2021 at 10:58 pm | Reply
  
  Hi Brandon:
  
  It appears that the broom package has changed what happens with the augment() function. I haven’t looked it carefully, but I see that the data frame out has the variable .std.resid instead of .resid. So I think if you replace mean(.resid) with mean(.std.resid) it should work fine.
  
  I’ll make a correction on those solutions.
  
  Thanks.
  
  Jim
  1. Brandon Alfond February 20, 2021 at 11:32 pm
    
    Thanks Jim! Greatly appreciate the quick response. Loving the book!
Robert H Carden February 20, 2021 at 9:47 pm | Reply

In section 6.2.3 and getting below error. Package missing driving this?

count_plot %+% run_value_by_count +
+ scale_fill_gradient2(“xRV”, low = grey10, high = crcblue,
+ mid = white)
Error in count_plot %+% run_value_by_count :
is.character(lhs) is not TRUE
1. Jim Albert February 21, 2021 at 1:30 pm | Reply
  
  Robert, I just tried running that Chapter 6 from the script posted on our Github site and I couldn’t reproduce your error. We are using the tidyverse suite of packages, but nothing else. Unfortunately, without having your computer in front of me, I am not sure what is creating the issue. Sorry not to be of more help. Jim
BENZI BLATMAN June 3, 2021 at 12:08 am | Reply

Can someone help me figure out how to make sense of this book? Chapter 2 is allegedly “Introduction to R,” and yet I can’t seem to find any instructions on how to actually access/import the data necessary to work alongside each exercise.
1. Jim Albert June 3, 2021 at 2:02 am | Reply
  
  All of the datasets described in the book are found in the data folder at https://github.com/beanumber/baseball_R. Also there is a package ABSRdata that contains most of the datasets. You can install this package by following the instructions at https://github.com/bayesball/ABWRdata
  
  Good luck — I know it can be challenging to get started.
Nick June 19, 2021 at 11:55 pm | Reply

This book has been a great help, but I’ve got stuck in section 3.7.1. I don’t believe I have the same type as the person above, but any help is appreciated. Here is what I’m typing:

get_birthyear<-function(Name){
Names%
filter(nameFirst==Names[1],
nameLast==Names[2])%>%
mutate(birthyear=ifelse(birthMonth>=7,
birthYear+1,birthYear),
Player=paste(nameFirst,nameLast))%>%
select(playerID,Player,birthyear)
}

It seems to me like this part of the code isn’t working. When I go to the next steps in the chapter for setting up the table, it comes up with 0 observations of 3 variables. I’ve read the Lahman database into R, so I’m not exactly sure what’s not connecting. Thanks!
1. Jim Albert June 20, 2021 at 12:08 am | Reply
  
  Nick:
  It seems that you didn’t type in the get_birthyear() function correctly — it should be as I’ve copied below.
  By the way, all of the code for the chapters can be found at. https://github.com/beanumber/baseball_R/tree/master/chapter_code
  Best: Jim
  
  get_birthyear <- function(Name) {
  Names %
  filter(nameFirst == Names[1],
  nameLast == Names[2]) %>%
  mutate(birthyear = ifelse(birthMonth >= 7,
  birthYear + 1, birthYear),
  Player = paste(nameFirst, nameLast)) %>%
  select(playerID, Player, birthyear)
  }
Robert Carden June 20, 2021 at 1:53 pm | Reply

Can anyone explain this message when trying to install a package: “Do you want to install from sources the package which needs compilation?”

Thanks in advance.
1. Jim Albert June 20, 2021 at 10:38 pm | Reply
  
  Robert, there are two ways to install packages, precompiled and those that consist of source programs (like C++ or Fortran) that need to be compiled. Most packages come precompiled, but if you have C++ and Fortran on your computer, you can compile the source packages. Usually the need-to-be compiled packages are the ones that are recently released — if you wait a few days, they will be available in the precompiled version.
Alfredo November 13, 2021 at 12:02 pm | Reply

Excellent book, it is a great tool for the study of statistics, and to make every baseball game even more interesting. I have been trying to follow the example of the Pythagorean expectation formula and how to obtain its exponent (pages 94-102), but I wonder what to do when in the final score a team made no runs, that is, ended with zero runs. In such cases, for example, for the calculation of logRratio (page 95), log(0) is Inf.
> log(0)
[1] -Inf
What is the way to deal with this data: remove this log from the data, or calculate, for example, log(0.1)?
Thanks
1. Jim Albert November 15, 2021 at 2:38 pm | Reply
  
  Alfredo, thanks for the kind comments on the book. Usually we apply the Pythagorean expectation formula for a collection of games where the runs scored for and against are positive. I don’t think we use it for a single game.
  1. Alfredo November 16, 2021 at 7:46 am
    
    Thanks a lot for you response.
  2. Alfredo November 16, 2021 at 7:56 am
    
    What I was trying to do, as a statistical exercise, is to apply the formula to several games in a season to observe how the expectation of games won changes over the dates. My intention was to find the time when the prediction most closely matched the final outcome. I am looking at a short 49-game season in my country’s professional league. Currently, they are going through game 20. My first choice was to think of a sample size for all 49 games. But doing the exercise by dates, I plan to find the point at which the prediction was most closely matched. Kindly, I would like to ask if you can think of any recommendations for my exercise? Thank you.
Justin Cassidy October 8, 2022 at 11:25 pm | Reply

I apologize as I am only on chapter 2 but already having a problem while following along with Warren Spahn’s csv. As I run this code, I get to
install.packages(“tidyverse”)
library(tidyverse)
library(Lahman)
getwd()
spahn <- read_csv("data/spahn.csv")
while read.csv works for getting data/sphan.csv, read_csv produces this error
Error in `vec_as_location()`:
! `…` must be empty.
x Problematic argument:
* call = call
Run `rlang::last_error()` to see where the error occurred.

I am fairly sure my working directory is set up correctly so I thought it might be something else wrong. I am completely lost and am just getting back to coding so any help would be greatly appreciated. Thank you!
1. Jim Albert October 10, 2022 at 12:03 am | Reply
  
  Justin, sorry, but there is no simple answer to that particular error message. I can’t reproduce that on my laptop. I’d suggest using read.csv() instead of read_csv() to read in data files — both functions do the same thing. Jim
aves25 March 15, 2023 at 12:56 am | Reply

Hello!
I was wondering how to get the hofbaseball.csv file to use for Chapter 3? I know it wasn’t in the original files when I download the csv files for 2017, but I can’t seem to figure out how to get a hold of it?
1. Jim Albert March 16, 2023 at 7:59 pm | Reply
  
  In Chapter 3, I don’t believe there is a hopbaseball.csv file mentioned, but there is hofbatting.csv and hofpitching.csv. These two files are available in the data folder on Github https://github.com/beanumber/baseball_R/tree/master/data
Jayson Stancil May 22, 2023 at 1:32 pm | Reply

Hi,
I recently bought the book, and love the work you’ve done. However, I am running into many issues becasue I believe the latest version of the Lahman package in R removed the Master table. Is there a work around to this, or am I doing something incorrect?
1. Jim Albert May 22, 2023 at 1:34 pm | Reply
  
  Jayson, in the Lahman package, that Master data frame has been renamed as the People data frame. JIm
  1. Jayson Stancil May 26, 2023 at 12:50 pm
    
    You’re a lifesaver. Thanks again!
Addison McGhee February 27, 2024 at 6:17 pm | Reply

Hi Jim!

Love the book!

I wanted to ask for some clarification on the first “Baseball Question” in section 1.2.8.

What I wanted to ask was how the “average number of home runs per game recorded in each decade” is calculated. Specifically, how are you obtaining the values 0.3, 0.8, and 2.2 in the paragraph below the question?

My approach was to use the Teams dataset and group_by the variable year_id. I thought that “home runs per game” could be calculated by taking the total number of home runs and dividing by the total number of games. But my results didn’t match what you had.

Thanks!
1. Jim Albert February 27, 2024 at 9:14 pm | Reply
  
  Addison:
  
  What you did was fine. But you were computing the average number of home runs per team per game. Since two teams are playing, the average number of home runs would be double of what you are finding.
  
  Jim
  1. addisonmcg99 February 27, 2024 at 9:39 pm
    
    Hi Jim, Thanks for the quick reply!
    
    Is this correct, or do I need to further group by decade?
    
    View(
    Teams %>%
    group_by(yearID) %>%
    summarise(total_games = sum(G),
    total_hr = sum(HR),
    hr_pg = 2 * total_hr / total_games)
    )

	Jim Albert on retrosheet Package and Compari…
	addisonmcg99 on retrosheet Package and Compari…
	Jim Albert on Calculation of Win Probabiliti…
	John Purlia on Calculation of Win Probabiliti…
	bbaumer21 on New Edition of Analyzing Baseb…

Exploring Baseball Data with R

Analyzing Baseball Data with R

150 responses

Leave a reply to Rob Carden Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta

Exploring Baseball Data with R

Analyzing Baseball Data with R

Share this:

150 responses

Leave a reply to Rob Carden Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta