Some information about the book *Analyzing Baseball Data With R, 2nd edition* by Max Marchi, Jim Albert, and Ben Baumer:

Some useful links for the book.

- The official site at CRC Press.
- The Amazon page for the book
- The GitHub repository containing the datasets and the scripts used in the book.

[…] About […]

[…] my foray into R with baseball is a neat graphic based on a recent post from the authors of Analyzing Baseball With R. They use the R statistical programming language to go through the copious amount of baseball […]

Question 7 Chapter 3, p. 85 asks you to pull Pete Rose’s info, but from what I can tell, the function “getinfo” doesn’t work for two players with the exact same name (junior), or am I wrong?

Thanks for sharing

Yeah, I’ll try to fix this and then make the new function available — thanks.

Errata link seems to be broken

Aaron — thanks for noticing that — it should work now.

Hello ! In regard to Chapter 9. I would like to have more information (further reading) about these topics:

1) How to build a transition matrix to simulate a complete game between two teams, taking into account all offensive(batters/runners) and defensive (pitchers/fielding) strength.

2) The method used in section 9.2.6 to estimate the transition matrix. In the book is written: “The description of this methodology is beyond the level of this book…” but no further reading or reference is given

I feel lot of interest on these topics and I appreciate to have some references to continue my research.

Thanks in advance. Sergio.

Sergio, you might find this article helpful. https://content.iospress.com/articles/journal-of-sports-analytics/jsa0001

Jim

Hello !! In regard to the Bradley-Terry model (chapter 9).

1) Section 9.4: Further reading. The reference of “Chapter 9 of Albert and Bennet (2003)” seems to be wrong as in my book copy the Bradle Terry model is developed in the “Chapter 12, Did the best team win?”. Maybe I have a different edition.

2) After examine the “Chapter 9 of Anaylzing Baseball Data With R” I jumped to the “Chapter 12, Did the best team win?. Curve Ball” with the hope of finding how to calculate the “Talent(t)” of teams. I did not find anyhing about it. The only way I know is “log5 model by Bill James”. I have thought on maximize the likelihood to find the “teams talent (t)”, but I would like to ask for some reading before jump on my own developing.

So, is there any other approach to calculate the talents? Can anyone helpy me with further readings about it?

Lot of thanks in advance !

Sergio.

Hello !!

In regard to the Bradley-Terry model (chapter 9).

1) Section 9.4: Further reading. The reference of “Chapter 9 of Albert and Bennet (2003)” seems to be wrong as in my book copy the Bradle Terry model is developed in the “Chapter 12, Did the best team win?”. Maybe I have a different edition.

2) After examine the “Chapter 9 of Anaylzing Baseball Data With R” I jumped to the “Chapter 12, Did the best team win?. Curve Ball” with the hope of finding how to calculate the “Talent(t)” of teams. I did not find anyhing about it. The only way I know is “log5 model by Bill James”. I have thought on maximize the likelihood to find the “teams talent (t)”, but I would like to ask for some reading before jump on my own developing.

So, is there any other approach to calculate the talents? Can anyone helpy me with further readings about it?

Lot of thanks in advance !

Sergio.

Sergio: As I recall, I used a value of the standard deviation of the Bradley Terry talent distribution so that predicted w/l records of the simulated data resembled the observed w/l records. One could formally fit this B-T model and estimate the standard deviation, but I believe I used this empirical approach to estimate the standard deviation. This type of B-T model is typically used in paired comparison models and there are some Bayesian papers on this.

Jim –

In chapter 1, you guys state that “In 2011, hitters compiled a .253 batting average on plate appearances where they fell behind 0-2. Conversely they hit .479 after going ahead 2-0.” I’m trying to replicate those numbers and even using the pbp11rc.csv file, I can’t even come close. Instead of batter average, did you mean OBP?

I’m trying to add an “Age” column in the Lahman batting.csv file. My idea is that I can use a combination of getinfo and the sapply function. I’m comfortable using the getinfo function for individual players. I’ve attempted to adapt the function to do this but I’m struggling. Any suggestions?

Much appreciated. I’m really enjoying this book so far!

Just got the new version of the book. In section 2.7 (p52), I’m getting an error with “object ‘crcblue’ not found.” Is that a color for the chart? I’ve tried uninstalling and reinstalling tidyverse and retyping the code and I can’t get through it. Any advice?

Sorry, we neglected to define that color code in our code — just add the code

crcblue <- "#2905a1".

Thanks.

I hate that I’m stumped on the first question I came across, and it seems straightforward, but for the average number of home runs per game recorded in each decade the answer says that the first two decades had 0.3 HRs per game. In the 1870 I get 356 HRs over 4,062 games for 0.09 per game, and in the 1880s 3,773 HRs over 17,484 games for 0.22 per game. What am I missing?

It seems that you double-counted games. When you sum the games variable, it will be twice as large as the actual number of games since 2 teams are in each game. With this correction, you will get reasonable looking values.

Dumb mistake on my part, I should have realized that error. Thank you very much for the response. I am enjoying the book a great deal!

Hi Jim,

In Chapter 7, when modeling called strike percentage, the book limits the PITCHf/x data to type == “S”, but from the description variable in the first column, it looks like “S” includes strikes of any kind. Rather than “S” and “X” being “called strikes” and “strikes” respectively as noted in the book, I’m pretty sure they stand for “strikes” and “balls in play.”

Thanks

Judah, I think you’re right — I’ll check with Ben. Thanks for mentioning the issue.

On page 84 the code goes as follows:

HRdata %

mutate(Age = yearID – birthyear) %>%

select(Player, Age, HR) %>%

mutate(CHR = cumsum(HR))

The graph however seems to perform the cumulative Sums for every players HR totals. Is there a way to unpack the Cumsum function for the individual players? Thank you for the awesome book.

Christian:

To do this for the individual plays, you just add a group_by(Player) line after the first mutate() line.

To get these summaries for individual players, you just insert a group_by(Player) before the mutate operation.

Hi guys. What a great book. Not sure if this is a great forum for questions but I’ll give it a shot. In chapter 6 it uses a “Cabrera” data set that has PitchFx data plus some batted ball location data. For these pre-loaded data sets, I’ve found it fun to see if I can recreate the same data from the source. I was able to do this for the PitchFx data, but I can’t find the batted ball data anywhere? Is that publicly available? It looks like Baseball Savant might have this data, but just in summary form and not in a database you can download. Thanks again for the great book. This is from the 1st edition if it matters

Tim, you can load pitch-by-pitch data from Statcast using the baseballr package. This gives you all of the pitch information together with the batted ball measurements.

Neat, can’t wait to look at that. But I am having trouble installing. I seem to be having the same issue described here which was never answered:

https://stackoverflow.com/questions/55938608/cant-install-baseballr-package

Although for me the error is “cannot remove prior installation of package ‘rlang’”

Hello,

Where can I find the pbp2016rc data? I do not see it in the file.

Herb, that file is from the 1st edition and can be found at https://github.com/maxtoki/baseball_R/tree/master/data

Jim

Hi, Jim. I’m fairly new to using R and have little coding background. I bought the book, downloaded R and the csv. files so far. I was able to download the Lahman package as well, but I can’t seem to get R to read the csv files? The error I get is Error in source(“/Users/caseystevens/Desktop/baseball_R-master/baseballdatabank-master/core/People.csv”) :

/Users/caseystevens/Desktop/baseball_R-master/baseballdatabank-master/core/People.csv:1:9: unexpected ‘,’

1: playerID,

^

I can open it in Numbers just fine, but when I select a csv file and ‘Open With’ the data looks messy. What can I do to get this fixed?

Casey, if the csv files are in the working directory, then you read them into R by statements like

data <- read.csv("filename.csv") It appears that you were trying to use a source() function that is used to read in R functions. I would try some of the sample scripts from the chapters to get started. Jim

The GitHub page for the 2nd edition of the book only has exercise solutions for 6 of the 14 chapters. Is there anywhere the rest of the solutions can be found?

Daniel, sorry but I think that’s all we have currently available in terms of solutions. Jim

When running the following code from the top of page 173 in the 2nd edition:

mod_a <- glmer(type == "S" ~ strike_prob + (1|fielder_2),

data = sc_taken, family = binomial)

mod_a is being created as . I’m typing the code exactly as it is written in the text so I’m not sure why this would be occurring. Any help on obtaining the proper code would be much appreciated.

Thomas: Here is similar more concise code that seems to work — hope this is helpful.

# load several packages

library(tidyverse)

library(mgcv)

library(lme4)

# read in the 2017 Statcast data

sc_2017 <- read_csv("../data/statcast2017.csv")

# only consider the called pitches

taken <- filter(sc_2017, type != "X")

# model the probability of a strike

strike_mod <- gam(type == "S" ~ s(plate_x, plate_z),

family = binomial,

data = sample_n(taken, size = 100000))

# work with a sample of 10,000 pitches

sample_taken <- sample_n(taken, size = 10000)

# define new variable that predicts the probability

# of a strike

sample_taken <- mutate(sample_taken,

strike_prob = predict(strike_mod,

newdata = sample_taken,

type = "response"))

# fits random effects model using strike_prob as

# covariate and catcher random effects

mod_a <- glmer(type == "S" ~ strike_prob +

(1|pos2_person_id),

data = sample_taken, family = binomial)

For the chapters with no solutions, is there a way to check my solutions? Thank you. By the way, I bought your book, “Teaching Statistics with Baseball” and wonder if there are solutions for it as well.

Stan, solutions to the exercises for some chapters of ABDR are available at https://github.com/beanumber/baseball_R

For my teaching stats using baseball book, just send me an email at albert@bgsu.edu, explaining that you are not a student in a class and I can send you solutions.