Analyzing Baseball Data with R

Some information about the book Analyzing Baseball Data With R, 2nd edition by Max Marchi, Jim Albert, and Ben Baumer:

Some useful links for the book.


128 responses

  1. […] my foray into R with baseball is a neat graphic based on a recent post from the authors of Analyzing Baseball With R.  They use the R statistical programming language to go through the copious amount of baseball […]

  2. Question 7 Chapter 3, p. 85 asks you to pull Pete Rose’s info, but from what I can tell, the function “getinfo” doesn’t work for two players with the exact same name (junior), or am I wrong?

  3. Yeah, I’ll try to fix this and then make the new function available — thanks.

  4. Errata link seems to be broken

    1. Aaron — thanks for noticing that — it should work now.

  5. Sergio Marrero Marrero | Reply

    Hello ! In regard to Chapter 9. I would like to have more information (further reading) about these topics:

    1) How to build a transition matrix to simulate a complete game between two teams, taking into account all offensive(batters/runners) and defensive (pitchers/fielding) strength.
    2) The method used in section 9.2.6 to estimate the transition matrix. In the book is written: “The description of this methodology is beyond the level of this book…” but no further reading or reference is given

    I feel lot of interest on these topics and I appreciate to have some references to continue my research.

    Thanks in advance. Sergio.

  6. Hello !! In regard to the Bradley-Terry model (chapter 9).

    1) Section 9.4: Further reading. The reference of “Chapter 9 of Albert and Bennet (2003)” seems to be wrong as in my book copy the Bradle Terry model is developed in the “Chapter 12, Did the best team win?”. Maybe I have a different edition.

    2) After examine the “Chapter 9 of Anaylzing Baseball Data With R” I jumped to the “Chapter 12, Did the best team win?. Curve Ball” with the hope of finding how to calculate the “Talent(t)” of teams. I did not find anyhing about it. The only way I know is “log5 model by Bill James”. I have thought on maximize the likelihood to find the “teams talent (t)”, but I would like to ask for some reading before jump on my own developing.

    So, is there any other approach to calculate the talents? Can anyone helpy me with further readings about it?

    Lot of thanks in advance !


  7. Hello !!

    In regard to the Bradley-Terry model (chapter 9).

    1) Section 9.4: Further reading. The reference of “Chapter 9 of Albert and Bennet (2003)” seems to be wrong as in my book copy the Bradle Terry model is developed in the “Chapter 12, Did the best team win?”. Maybe I have a different edition.

    2) After examine the “Chapter 9 of Anaylzing Baseball Data With R” I jumped to the “Chapter 12, Did the best team win?. Curve Ball” with the hope of finding how to calculate the “Talent(t)” of teams. I did not find anyhing about it. The only way I know is “log5 model by Bill James”. I have thought on maximize the likelihood to find the “teams talent (t)”, but I would like to ask for some reading before jump on my own developing.

    So, is there any other approach to calculate the talents? Can anyone helpy me with further readings about it?

    Lot of thanks in advance !


    1. Sergio: As I recall, I used a value of the standard deviation of the Bradley Terry talent distribution so that predicted w/l records of the simulated data resembled the observed w/l records. One could formally fit this B-T model and estimate the standard deviation, but I believe I used this empirical approach to estimate the standard deviation. This type of B-T model is typically used in paired comparison models and there are some Bayesian papers on this.

  8. Jim –

    In chapter 1, you guys state that “In 2011, hitters compiled a .253 batting average on plate appearances where they fell behind 0-2. Conversely they hit .479 after going ahead 2-0.” I’m trying to replicate those numbers and even using the pbp11rc.csv file, I can’t even come close. Instead of batter average, did you mean OBP?

  9. I’m trying to add an “Age” column in the Lahman batting.csv file. My idea is that I can use a combination of getinfo and the sapply function. I’m comfortable using the getinfo function for individual players. I’ve attempted to adapt the function to do this but I’m struggling. Any suggestions?

    Much appreciated. I’m really enjoying this book so far!

  10. Just got the new version of the book. In section 2.7 (p52), I’m getting an error with “object ‘crcblue’ not found.” Is that a color for the chart? I’ve tried uninstalling and reinstalling tidyverse and retyping the code and I can’t get through it. Any advice?

    1. Sorry, we neglected to define that color code in our code — just add the code
      crcblue <- "#2905a1".


      1. Can you share what the entire snippet of code should look like? I was having same error until I tried this suggestion but no luck. I have following:

        crcblue <- "#2905a1" followed by:
        ggplot(ws2, aes=(x = wins + losses)) + geom_bar(fill = crcblue) + labs(x= "Number of Games", y= "Frequency")

        After which I receive this mesage:

        "Error: stat_count() requires an x or y aesthetic.
        Run `rlang::last_error()` to see where the error occurred."

      2. I figured it out. One d*mn character bites you every time.

  11. I hate that I’m stumped on the first question I came across, and it seems straightforward, but for the average number of home runs per game recorded in each decade the answer says that the first two decades had 0.3 HRs per game. In the 1870 I get 356 HRs over 4,062 games for 0.09 per game, and in the 1880s 3,773 HRs over 17,484 games for 0.22 per game. What am I missing?

    1. It seems that you double-counted games. When you sum the games variable, it will be twice as large as the actual number of games since 2 teams are in each game. With this correction, you will get reasonable looking values.

      1. Dumb mistake on my part, I should have realized that error. Thank you very much for the response. I am enjoying the book a great deal!

  12. Hi Jim,

    In Chapter 7, when modeling called strike percentage, the book limits the PITCHf/x data to type == “S”, but from the description variable in the first column, it looks like “S” includes strikes of any kind. Rather than “S” and “X” being “called strikes” and “strikes” respectively as noted in the book, I’m pretty sure they stand for “strikes” and “balls in play.”


    1. Judah, I think you’re right — I’ll check with Ben. Thanks for mentioning the issue.

  13. On page 84 the code goes as follows:
    HRdata %
    mutate(Age = yearID – birthyear) %>%
    select(Player, Age, HR) %>%
    mutate(CHR = cumsum(HR))

    The graph however seems to perform the cumulative Sums for every players HR totals. Is there a way to unpack the Cumsum function for the individual players? Thank you for the awesome book.

    1. Christian:

      To do this for the individual plays, you just add a group_by(Player) line after the first mutate() line.

    2. To get these summaries for individual players, you just insert a group_by(Player) before the mutate operation.

  14. Hi guys. What a great book. Not sure if this is a great forum for questions but I’ll give it a shot. In chapter 6 it uses a “Cabrera” data set that has PitchFx data plus some batted ball location data. For these pre-loaded data sets, I’ve found it fun to see if I can recreate the same data from the source. I was able to do this for the PitchFx data, but I can’t find the batted ball data anywhere? Is that publicly available? It looks like Baseball Savant might have this data, but just in summary form and not in a database you can download. Thanks again for the great book. This is from the 1st edition if it matters

    1. Tim, you can load pitch-by-pitch data from Statcast using the baseballr package. This gives you all of the pitch information together with the batted ball measurements.

      1. Neat, can’t wait to look at that. But I am having trouble installing. I seem to be having the same issue described here which was never answered:

        Although for me the error is “cannot remove prior installation of package ‘rlang’”

  15. Hello,

    Where can I find the pbp2016rc data? I do not see it in the file.

    1. Herb, that file is from the 1st edition and can be found at


    2. That is showing only the 2011 file not the 2016, unless I am missing something?

      1. Nick: Can you clarify — what edition and what data file are you looking for? Thanks. Jim

      2. I am having the same problem as well. That link leads to 2011 play by play data. Not sure where I can get the pbp2016rc

      3. Lou, that data frame pbp2016rc is created from the pbp2016 dataset. See the work in the R script


  16. Hi, Jim. I’m fairly new to using R and have little coding background. I bought the book, downloaded R and the csv. files so far. I was able to download the Lahman package as well, but I can’t seem to get R to read the csv files? The error I get is Error in source(“/Users/caseystevens/Desktop/baseball_R-master/baseballdatabank-master/core/People.csv”) :
    /Users/caseystevens/Desktop/baseball_R-master/baseballdatabank-master/core/People.csv:1:9: unexpected ‘,’
    1: playerID,

    I can open it in Numbers just fine, but when I select a csv file and ‘Open With’ the data looks messy. What can I do to get this fixed?

    1. Casey, if the csv files are in the working directory, then you read them into R by statements like
      data <- read.csv("filename.csv") It appears that you were trying to use a source() function that is used to read in R functions. I would try some of the sample scripts from the chapters to get started. Jim

  17. The GitHub page for the 2nd edition of the book only has exercise solutions for 6 of the 14 chapters. Is there anywhere the rest of the solutions can be found?

    1. Daniel, sorry but I think that’s all we have currently available in terms of solutions. Jim

  18. When running the following code from the top of page 173 in the 2nd edition:

    mod_a <- glmer(type == "S" ~ strike_prob + (1|fielder_2),
    data = sc_taken, family = binomial)

    mod_a is being created as . I’m typing the code exactly as it is written in the text so I’m not sure why this would be occurring. Any help on obtaining the proper code would be much appreciated.

    1. Thomas: Here is similar more concise code that seems to work — hope this is helpful.

      # load several packages


      # read in the 2017 Statcast data

      sc_2017 <- read_csv("../data/statcast2017.csv")

      # only consider the called pitches

      taken <- filter(sc_2017, type != "X")

      # model the probability of a strike

      strike_mod <- gam(type == "S" ~ s(plate_x, plate_z),
      family = binomial,
      data = sample_n(taken, size = 100000))

      # work with a sample of 10,000 pitches

      sample_taken <- sample_n(taken, size = 10000)

      # define new variable that predicts the probability
      # of a strike

      sample_taken <- mutate(sample_taken,
      strike_prob = predict(strike_mod,
      newdata = sample_taken,
      type = "response"))

      # fits random effects model using strike_prob as
      # covariate and catcher random effects

      mod_a <- glmer(type == "S" ~ strike_prob +
      data = sample_taken, family = binomial)

  19. For the chapters with no solutions, is there a way to check my solutions? Thank you. By the way, I bought your book, “Teaching Statistics with Baseball” and wonder if there are solutions for it as well.

    1. Stan, solutions to the exercises for some chapters of ABDR are available at
      For my teaching stats using baseball book, just send me an email at, explaining that you are not a student in a class and I can send you solutions.

  20. Thanks for the book. I have the 2nd edition. I have been stuck for several days on Section 11.7 on the line “This assumes that user has a MySQL option file preconfigured to connect to a database retrosheet”. I suppose this is for “src_mysq_cnf(“retrosheet”)” to run. I’m similarly stuck in 12.2.1 using statcastr on the line src_mysql_cnf(“statcast”). I have My_SQL on my machine and use it often. Any references you can point me to help me set up a “MySQL option file preconfigured to connect to a database retrosheet”.

    1. Yes! Here is the official documentation:

      Basically, the idea is that you keep your credentials in a file that is private to you, so that you can show how to connect to databases publicly without revealing your credentials.

      1. I ran mysql_config_editor and now have a mylogin.cnf file that contains
        user = localuser
        password = *****
        host = localhost
        I can now connect with src_mysq_cnf(“retrosheet”,groups=”local”)
        Do you know how I can put DBI connection in the mylogin.cnf file so I could connect with just
        Thank you so much for your time

  21. Statcastr is not working for some 2019 months, however is working for others.
    I was able to run etl_extract(year=2019, month=6) & etl_extract(year=2019, month=7) just fine.
    All other 2019 months give errors. For example…
    etl_extract(year=2019, month=3) gives
    Error: Column `game_date` can’t be converted from character to Date
    etl_extract(year=2019, month=4) gives
    Error: Column `fielder_2…42` can’t be converted from numeric to character
    I believe the error is cause by “empty” dates, dates with not statcast output.
    Is there a way to refine the etl_extract() input so it skips over these dates, or give it a specific date, or specific date range?
    Thanks again!

  22. I’m still pretty new to R so I may be simply overlooking this, but at the beginning of Chapter 5 I’m having trouble getting the “data2016” csv file.

    The line I’m struggling with is on page 112, “data <- read_csv("data/all2016.csv", col_names = pull(fields, Header), na = character())

    I'm getting the error message "Error: 'data/all2016.csv' does not exist in current working directory ", but I made sure every file related to this is in the right place.

    Again, I've looked through all of my downloaded files to see if it could be misplaced and I don't see it anywhere on Github. I've also tried re-downloading the files and still no luck Any suggestions? Thanks!

    1. Zack, the reason why that all2016.csv file wasn’t there was that it was too large for Github. But since things have changed, I was successful in getting that file in the data folder. Download the files from Github and try it again. Jim

      1. I am also trying this code and get this error.

        See spec(…) for full column specifications.
        |===========================================================================| 100% 77 MB
        Warning: 182 parsing failures.
        row col expected actual file
        3347 REMOVED_FOR_PR_RUN2_ID 1/0/T/F/TRUE/FALSE navad002 ‘all2016.csv’
        5180 REMOVED_FOR_PR_RUN2_ID 1/0/T/F/TRUE/FALSE cronc002 ‘all2016.csv’
        5185 REMOVED_FOR_PR_RUN3_ID 1/0/T/F/TRUE/FALSE mazan001 ‘all2016.csv’
        6233 REMOVED_FOR_PR_RUN2_ID 1/0/T/F/TRUE/FALSE calhk001 ‘all2016.csv’
        7319 REMOVED_FOR_PR_RUN2_ID 1/0/T/F/TRUE/FALSE hollm001 ‘all2016.csv’
        …. …………………. ……………… …….. ………….
        See problems(…) for more details.

        Based on some research, I think I have to fix this. Is this error reproducible on your end sir? Thanks.

      2. Stan:

        I got similar parsing errors, but I don’t think this is a problem for most of what you want to do with this data.


      3. Thank you sir.

  23. Dear Authors,
    I have the 2nd edition. Could you check the matrix in Ch9 p209 entered as “RE” with first entry 0.47. I don’t recognize this as a matrix from Ch 5. The first entry in the expectancy matrix in Ch5 in 0.50. Is this a typo?

  24. R newbie struggling with this error message when using mutate:

    Batting %>%
    + mutate(decade=10*floor(yearID/10))%>%
    + split(pull(.,decade))%>%
    + map_df(hr_leader, .id=”decade”)

    Get error message: “Error in mutate(decade = 10 * floor(yearID/10)) :
    object ‘yearID’ not found”

    Is there a package I’ve not installed?

    1. Rob, I just checked this code and it seems to work. You need to have the Lahman and dplyr packages installed. Maybe that is the issue. Good luck. Jim

      1. Something is amiss. I have Lahman, dplyr, magrittr, purrr all installed.

      2. Sometimes R gets confused with some dplyr commands — if you replace mutate with dplyr::mutate it may work.

      3. Like this?:
        Batting %>%
        + + + dplyr::mutate(decade=10*floor(yearID/10)) %>%
        + + + split(pull(.,decade)) %>%
        + + + map_df(hr_leader,.id=”decade”)

        Unfortunately, this returned same result.

      4. Rob: Are you copying and pasting directly from the book? Those “+ + +” signs shouldn’t appear. Jim

  25. I got it. I reinstalled all 4 of those packages, did the library() for each of them, and then reentered each of the snippets of code that defined each of the variables and function leading up to the troublesome step and it worked. Shout out to Jim for his time and patience. Thanks.

    1. Rob, I know it can be frustrating getting started. Glad it worked. Jim

  26. […] 4 of Analyzing Baseball Data with R doesn’t introduce a lot of new functions. Instead, it shows how one can synthesize what they’ve […]

  27. […] bizarre happened to me this week while reading Chapter 5 of Analyzing Baseball Data with R. After reading and re-reading a paragraph explaining how to set up some lines of code, I thought to […]

  28. […] phase, I may be coding, but I’m not creating. I have replicated a lot of the projects from Analyzing Baseball Data with R, but aside from looking at bases-loaded, no-outs situations and Barry Bonds in 0-2 counts, I […]

  29. When trying to scrape May 2016 Pitch Fx data using the code from 7.2, I get the error Error in function (type, msg, asError = TRUE) : Could not resolve host:

    Is this a PitchRx package error ?

    1. Lou, I haven’t tried the PitchRx package recently. I generally scrape the Statcast data available through baseball savant through the baseballr package. Jim

    2. Lou, I am getting the same issue now. How did you solve this?

      1. Hi! I couldn’t solve that issue (can’t really use the PitchRx package at all). But as a workaround to continue working with Chapter 7 I scraped the data from Statcast using baseballr.

        my_pitches <- rbind(scrape_statcast_savant("2016-05-01", "2016-05-10"),
        scrape_statcast_savant("2016-05-11", "2016-05-20"),
        scrape_statcast_savant("2016-05-21", "2016-05-30"),
        scrape_statcast_savant("2016-05-31", "2016-05-31"))

        I get fairly the same data. Note that variables pz and px are now plate_z and plate_x respectively.

  30. Patrick Mitchell | Reply

    hey Jim,
    extremely new into programming and I am working hard to learn this material. i am struggling on page 68,ch3 of version two. i keep getting the error message

    “Error in eval(lhs, parent, parent) : object ‘hof’ not found”

    when inputting the code:

    labels=c(“19th Century”,”Dead Ball”,”Lively Ball”,”Integration”,”Expansion”,”Free Agency”,”Long Ball”)))

    directly from the book. i have the hofbatting.csv file imported and the table is open. i have the Lahman, dbplyr, and ggplot2 packages installed. Not sure whats going wrong. Thank you for any help, Pat

    1. Hi Pat. It appears that you don’t have the object hof in your R workspace. You said that you imported it — maybe it is saved as hofbatting? I know it can be frustrating getting started, but as you get some progress, things will go better. Jim

  31. Jim, im back again! on page84 of version 2,
    I load the Lahman library, I type in the entire get_birthyear function straight from the book, and then i use that funtion to try to get the info for Ruth, Aaron, Bonds and A-rod but after inputting the bind_rows command i get a PlayerInfo table with no data available, which leads to a plot with no information. Am I missing something? thank you for all of the help, Pat

    1. Patrick, by the way, all of the book code is found in the folder
      The bind_rows() function is from the dplyr package, so you need to load dplyr by the library(dplyr) command. Then that bind_rows() function should work — PlayerInfo is a data frame containing that birth info for the four players. Jim

  32. Great book. Having some issues with 3.7.1 in the 2nd edition. This is the code I’m typing then the error I’m getting below. Any advice? Thank you!

    get_birthyear <- function(Name) {Names % filter(nameFirst == Names [1], nameLast == Names [2]) %>% mutate(birthyear = ifelse(birthMonth >= 7, birthYear + 1, birthyear), Player = paste(nameFirst, nameLast)) %>% select(playerID, Player, birthyear)}

    Error: unexpected symbol in “get_birthyear <- function(Name) {Names <- unlist(strsplit(Name, " ")) Master"

    1. Scott, I see some typos in what you sent me. I would suggest going to where we show all of the R code for all of the chapters. Good luck. Jim

  33. Jim, really enjoying the book and it is getting a lot easier for me to work through it now. Thank you for pointing out the chapter code in github. I have run into an issue in ch6. When copying and pasting the code from the book, everything works exactly as it should until i try to recreate figure 6.4 on pg154 of version 2. i get the error code “Error in FUN(X[[i]], …) : object ‘stat(level)’ not found”. originally i was able to see it, but there was an error that said “package: directlabels not found” so i went downloaded the package and got the error above. I unloaded the directlabels package but i still get that same error. Any suggestions? Thank you for all of the continued help. Pat

    1. I noticed if i did not include the part of the code ” cabrera_plot %
      directlabels::direct.label(method = “bottom.pieces”) ” i will see the plot shown in figure 6.4 without the labels surrounding each circle (0.8,0.6,0.4,0.2)

    2. Patrick, if you change stat(level) to ..level.. it should work (ggplot2 goes through changes). The directlabels package appears to work. Jim

      1. Jim, to confirm, should the code read like this?:

        cabrera_plot = 0, fit <= 1) +
        stat_contour(aes(z = fit, color = level),
        binwidth = 0.2) +
        scale_color_gradient(low = "white", high = crcblue)

        cabrera_plot %
        directlabels::direct.label(method = “bottom.pieces”)


        I am getting same error message and do have directlabels installed and loaded.

      2. Rob, what type of error message are you getting? Jim

      3. Rob, I think you should use the argument

        color = ..level..


      4. Jim, that was it. I should have read you instruction more literally…….thanks

  34. Jim, thank you for all of the help so far. I hope I am not abusing this site with all of my questions. I am getting the same issues as someone previous with section 7.2. The code ” files<-c("inning/inning_all.xml","inning/inning_hit.xml",
    connect=db$con,suffix=files) " returns the error: " Error in function (type, msg, asError = TRUE) :
    Could not resolve host: writefunction "
    Any suggestions?

    1. Patrick, I am not familiar with the data base work of Ben in that chapter — you could try asking Ben. It does seem that the pitchRx package has some issues — this data is better scraped from Baseball Savant using the baseballr package. Jim

  35. Thank you for the pointers. I have continued to enjoy working through the book and learning more and more with every chapter. I am getting an error now with 11.5.1. When putting in the code:

    chi_attendance %
    mutate(the_date = ymd(date),
    attendance = ifelse(attendance == 0, NA, attendance))

    I get the error:

    Error: Problem with `mutate()` input `the_date`.
    x cannot coerce type ‘closure’ to vector of type ‘character’
    ℹ Input `the_date` is `ymd(date)`

    Do you have any suggestions about this? I spoke with Ben and he said it seems R thinks I am using the data() function. do you agree, and if so is there a way around this? Thank you again, Pat

    1. Patrick, I’m not sure — I’m not particularly familiar with the code in this database chapter since Ben worked on it. I’d check to see if the variable data in the data frame chi_attendance is character-type. Perhaps you could try

      mutate(the_date = ymd(as.character(data))

      in that code. Jim

  36. Is the statcast2017.csv from chapter 7 available anywhere or does it have to be scraped? I dont seem to see it on github.

    1. No, sorry, there are limits to the size of files that I can place on Github. But you can scrape it using the baseballr package.

      1. Forget that you can obtain some of these extra datasets including statcast2017.csv from the ABWRdata package at

  37. Hey all,

    I am relatively new to R and I am struggling with inserting the column headers for the Retrosheet data.

    If anyone could provide further insight on how to handle this it would be much appreciated. I have downloaded the Retrosheet data from their website into a folder on my computer, however moving this process along has proven to be challenging for a beginner like myself. If someone could provide further guidance on how to get the column headers into R for the Retrosheet data it would be greatly appreciated.

    Thank you!

    1. Lloyd, if you navigate to our book’s Github site at , you will find a file fields.csv which provides the column headers of the Retrosheet play-by-play files. Also this folder contains several Retrosheet files all1998.csv and all2016.csv for the 1998 and 2016 seasons. Hope this helps. Jim

  38. Jim,

    Thank you for your guidance! I was able to upload the column headers for the play by play files relatively easily!

    I quickly realized that a different set of column headers would be necessary for the Game Log, files. However, when I tried to replicate this process I was unable to achieve this.

    I denoted the … as those were my working directories within R. I continued to receive the error listed below. Let me know if you catch anything here.

    Thanks again,

    > GL2013 View(GL2013)
    > gl.headers names(GL2013) <- gl.headers[,"Header"]

    Error in `[.data.frame`(gl.headers, , "Header") :
    undefined columns selected

    1. Lloyd, the game log headers are found in the file game_log_headers.csv at

      Something like

      names(GL2013) <- names(game_log_headers)

      should work once you've imported that csv file.


      By the way, you are welcome to participate in my R workshop that is coming up a week from Friday — I'll mention the particulars early next week.

  39. Jim,

    Thanks for that additional note there, was able to resolve that quickly as well.

    Thanks again and I would love to attend that workshop. Sounds like a great opportunity.

    Let me know where I can seek additional details.


    1. Lloyd: I posted information about the meeting at
      There are two zoom rooms that you sign up for. Jim

      1. Jim,

        Thank you for your continued guidance.


  40. Mr. Albert,
    Good Evening. I can try it out, but for the MySQL chapter, since I upgraded to Mac OSX 11 Big Sur and the MySQL page has up to 10.15, what would you do aside from downgrading, waiting etc.? Thanks.

  41. Hi all,
    I’m new to R, like really new, and all my statistical knowledge comes via Stata so I’m trying to learn R via this book both to switch to the industry standard and learn R.
    I’m having a lot of issues with Chapter 7, however.
    I can’t acquire the pitch-level data.
    I’m either too dense to follow along with the GitHub suggestions or I’m just broken.

    db <- src_sqlite("data/pitchrx.sqlite", create = TRUE)
    Returns error: Error: Could not connect to database:
    unable to open database file
    In addition: Warning message:
    `src_sqlite()` is deprecated as of dplyr 1.0.0.
    Please use `tbl()` directly with a database connection

    my_db <- src_sqlite("pitchRx.sqlite3", create = TRUE)
    files <- c("inning/inning_all.xml", "inning/inning_hit.xml",
    "miniscoreboard.xml", "players.xml")
    scrape(start = "2016-05-01", end = "2016-05-31",
    connect = my_db$con, suffix = files)
    returns: Error in function (type, msg, asError = TRUE) :

    I just…have no idea what's going on. Should I be used the mlbgameday package instead?
    Sorry if this has been answered–maybe I just don't know where/how to look

    1. Hi! I couldn’t solve that issue (can’t really use the PitchRx package at all). But as a workaround to continue working with Chapter 7 I scraped the data from Statcast using baseballr. With the following code

      my_pitches <- rbind(scrape_statcast_savant("2016-05-01", "2016-05-10"),
      scrape_statcast_savant("2016-05-11", "2016-05-20"),
      scrape_statcast_savant("2016-05-21", "2016-05-30"),
      scrape_statcast_savant("2016-05-31", "2016-05-31"))

      I get fairly the same data. Note that variables pz and px are now plate_z and plate_x respectively.

      1. I cannot seem to install baseballr. “Object not found”. RStudio tells me I have the most recent version. What else might it be? Absolutely loving this book.

      2. Rob, have you tried to reinstall baseballr from Bill Petti’s github site? Jim

      3. No, admittedly did not know that was the source. Would not know how either. It’s ok, I’ll move along. Thanks, Jim.

  42. I see a number of people having the same issue with pitchRx:

    I worry that something has changed on the data provider side.

  43. Anyone else have trouble locating “hofbatting.csv”? No matter which Lahman csv collection I pull down I do not see this file in the collection of comma delimited files. Thanks in advance.

    1. Welp, figured out the raw button in Github gave me access to the comma-delimited data. Copied that, pasted it into notepad, saved as hofbatting.csv and was able to read into R to create the hof file as directed on page 67.

      1. Rob, glad you figured it out. Actually you should be able to download all of the files on the Github site as a zip file — this might help. Jim

      2. Managed to figure that out too. Thanks.

  44. Morning, anyone come across this issue when working on graphics. Up until this I have had no issues duplicating the graphics thus far in the book.:

    Error in library(ggrepel) : there is no package called ‘ggrepel’



    1. Never mind. Mystery solved.

  45. Feel like I am hijacking this site (sorry). Currently on 6.3.3 and after entering the below code I get the following error message: “Error in is_reference(x, quote(expr = )) : object ‘crc_3’ not found”

    Missing package I presume?

    k_zone_plot %+% filter(ump_count_fits, fit 0.4) + geom_contour(aes(z = fit, color = count, linetype = count), binwidth = 0.1) + scale_color_manual(values = crc_3)

    1. Rob, there is a file global_config.R that contains many of those variables like crc_3. It is in the same folder where you found the chapter code. You should source(“global_config.R”) first. Jim

  46. Thank you for a great book! I am now in Chapter 6 (Second Edition). For me reading this through e-book includes also learning baseball terms and abbreviations, because in my home country Finland baseball is almost non-existent. Instead we have game called pesäpallo, which in many ways resembles its American ancestor, because Finnish Tahko Pihkala developed it from baseball.

    I am using the book for learning efficient use of R, and it is for great help. I have wrote the code in book to RStudio and normally got exactly same or very similar values. But in Chapter 6 there are big differences. I have a question about them.

    A. This is part of the code for c22:
    c22 = grepl(paste0(“^”, b, b, s, s, “|”, b, s, b, s,

    If you check for example c22[42] from pseq, it is “CBBBSFFB”, so on my understanding c22[42] should be FALSE, but with code like A in which there is only one “^”, c22[42] is TRUE.

    I was thinking, that maybe if you don’t write “^” after every “|”, it is not concatenated?

    B. Should it be:
    c22 = grepl(paste0(“^”, b, b, s, s, “|”,”^”, b, s, b, s,

    With code B the truth value changes: c22[42] is FALSE as it should be. If this is right, then “^” should be repeated after every “|”. I must say that concatenation in R is new to me, so thanks for help!

    1. Mika, thanks for your interest in the book. I think you’re the first person from Finland to comment. My coauthor Ben wrote Chapter 6 and he’s more familiar with regular expressions that I am. But I thought you might appreciate a different way of constructing the variables c11, c22, etc. I’ve posted a function setup_work() on my Github Gist site
      In the meantime, I’ll check into your particular question. Best. Jim

      1. Thank you, Jim. With your code c22[42] is FALSE as it should. Using your code I get the same values for figure 6.2 (page 147) I got earlier using this with repeated “^”:

        c22 = grepl(paste0(“^”, b, b, s, s, “|”,”^”, b, s, b, s,

        For example the value 2,2 on figure 6.2 is -0.039 instead of -0.025. On page 149 when decomposing 2,2 to different orders 0-2, 1-1 and 2-0, the mean run value of 2-0 is now -0.042 instead of 0.006, which is slightly smaller than after 1-1 (-0.040) and 0-2 (-0.034).

    2. Hi Mika, thanks for sending this comment. I think you are right. The regular expression that we generated is not quite right, and this is a good test case. Adding the anchor character (“^”) would test the sequences from the beginning, which is better than what we have in the book.

      We’ll fix this for the 3rd edition!

      1. Dear Mr. Baumer,
        Thank you for the book and sorry for piggybacking, but when is the ETA for the 3rd edition as the 2nd edition was released relatively recently?

      2. Thank you, Ben! Your book is great help in learning more about R. Amount of statistics in baseball is huge, and we like that. 🙂

  47. Thanks for checking that — I had not tried to reconcile the regular expression work with my code. My code runs slower but to me it it easier to follow. Jim

    1. I have to say that I remembered the magnitude, but now I checked exact values. They are not exactly the same:

      With code in the book and repeated “^” I get in c22:
      0-2 (-0.0310), 1-1 (-0.0376) and 2-0 (-0.0423)

      With your code I get in c22:
      0-2 (-0.0343), 1-1 (-0.0397) and 2-0 (-0.0416)

      In the book for example 2-0 is 0.00606, so on my last message I remembered that I got -0.04 something and your value has same magnitude with that (and it means that at 2-0 the mean run value is not bigger but slightly smaller than in 0-2 or 1-1). Thank you, I use your code in calculations now that I continue reading of the book!

  48. Thank you for the great book! As someone who is about to graduate from college and is pursuing a career in sports analytics, this book has been amazing. I do have one problem that I was hoping someone could help me with. In chapter 7 section 4, I am having some trouble. The following is my code:

    #Modeling Called Strike Percentage
    strike_mod <- gam(type == "S" ~ s(plate_x,plate_z),
    family = binomial, data = taken)

    #compute fitted values
    hats %
    augment(type.predict = “response”)

    #Update k_zone_plot
    k_zone_plot %+% sample_n(hats,50000) +
    geom_point(aes(color = .fitted),alpha =0.1) +
    scale_color_gradient(low = “gray70”, high = crcblue)

    but the error I keep getting is:

    Error in eval(predvars, data, env) : object ‘plate_x’ not found
    In addition: Warning messages:
    1: Tidiers for objects of class gam are not maintained by the broom team, and are only supported through the glmlm tidier method. Please be cautious in interpreting and reporting broom output.
    2: In predict.gam(x, newdata, type = type.predict) :
    not all required variables have been supplied in newdata!

    (I wasn’t able to use PitchRx so I used someone’s advice from above and scraped the data from Baseball savant that’s why the variables are plate_x and plate_z) Any guidance or advice would be greatly appreciated.

    1. Jacqueline:

      I agree there is some confusion since the PitchRx data isn’t available. Also the error message indicates some issue using the broom package. Starting with some Statcast data that I call statcast2020, here is some revised script for Sections 7.3 through 7.4.2 that I have posted at

      This should work for you, but let me know if you still have some issues.

      We’re glad you are finding the book helpful.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: