Attending the SABR Jack Graney Chapter Winter Meeting
Recently, I attended a recent SABR chapter meeting and one of the speakers was Chuck Smith who pitched for the Atlanta Braves during the 2000 and 2001 seasons. Although Smith had a short tenure in MLB, he actually had a long playing career for minor-league teams and teams in South Korea and Taiwan. His talk was an interesting story about endurance and patience and Smith was very grateful for his opportunities playing baseball. It seemed that baseball provided a good preparation for his future career in government and business.
Smith made a couple of comments during his presentation that were particularly interesting to me. Since he played during the so-called Steroids Era, he was asked if any of the big sluggers (Bonds, Sosa, and McGwire) hit home runs against him. Smith said that he was pretty successful pitching against these big sluggers and actually the less powerful batters were the ones that hit home runs against him. (He was effective in throwing off-speed pitches that were effective against the sluggers.)
That raises the question — was Smith telling the truth? Who hit home runs against Chuck Smith and how does this group compare to the group of hitters who did not hit home runs?
I’ll use this question to illustrate how the the piping operator
%>% is useful for implementing the step-by-step operations in data manipulation. This piping operator was introduced in the
magrittr package and is currently incorporated in the
tidyverse collection of packages. After I answer the question about Chuck Smith, I’ll show how this piping operator is helpful in communicating learning by Bayesian methods.
Who Hit Home Runs Against Chuck Smith?
First I collect the Retrosheet play-by-play data for the 2000 and 2001 seasons stored in the variable
pbp_2_seasons. Also from the
Master data frame in the Lahman package, I find Smith’s
retroID stored in the value
To find the batters who hit a home run against Smith, (1) I use the
filter operation to look at only the plays where a home run was hit and Smith was the pitcher, (2) I collect (
BAT_ID values for those batters.
pbp_2_seasons %>% filter(EVENT_CD == 23, PIT_ID == CS$retroID) %>% select(BAT_ID) -> CS_batters
Next, I want to create a data frame containing all batters who had at least 50 at-bats during the 2000-2001 seasons with an indicator if each batter hit a home run against Smith or not.
Here’s what I did starting with the Retrosheet dataset:
- I only consider plays where there was an official at-bat
- I group the data by the batter.
- For each batter, I collect the number of at-bats and count of home runs
- I restrict attention to the batters with at least 50 AB.
- By use of the new variable
Smith, I record if each batter hit a home run against Smith or not.
- I merge the data with the
nameLastfield of the
Masterdata frame (this will be helpful for putting meaningful labels on the graph to follow).
These operations correspond, line by line, to the R code below that ties this operations by a series of pipes.
pbp_2_seasons %>% filter(AB_FL == TRUE) %>% group_by(BAT_ID) %>% summarize(HR = sum(EVENT_CD == 23), AB = n()) %>% filter(AB >= 50) %>% mutate(Smith = ifelse(BAT_ID %in% CS_batters$BAT_ID, "YES", "NO")) %>% inner_join(select(Master, retroID, nameLast), by=c("BAT_ID" = "retroID")) -> S50
Below I construct parallel boxplots of the home run rates of batters who hit and did not hit home runs against Smith.
What do we see?
- Despite Smith’s statement, Barry Bonds did hit a home run against him, but not Sosa or McGwire.
- The players that hit home runs against Smith tended to have higher home run rates than those who didn’t hit home runs against Smith. Actually that is to be expected since if one did hit a home run against a specific pitcher, this is some evidence that the batter is above average in hitting home runs.
- I did do an additional check that indicated that Chuck Smith’s rate in allowing home runs was actually lower than the average starter in those two seasons.
In the above work, I illustrated piping using the
dplyr data manipulation verbs. But actually piping can be done using any functions of interest.
TeachBayes package, I have several functions helpful for communicating Bayes’ rule. The
bayesian_crank function will compute posterior probabilities for a data frame with columns
Likelihood, and the
prior_post_plot function will graphically compare prior and posterior distributions.
Suppose we have a hitter with 30 hits in 100 at-bats for an observe AVG of .300. What have we learned about his true batting average P?
- I construct a discrete uniform prior over the set .1, .11, …, .40.
- I compute the likelihood which is the binomial probability of 30 hits in 100 at-bats viewed as a function of P.
- Then I compute the posterior probabilities
- Compare the prior and posterior probabilities graphically.
library(TeachBayes) library(dplyr) data.frame(p=seq(.10, .40, by = .01), Prior=rep(1/31, 31)) %>% mutate(Likelihood = dbinom(30, size=100, p)) %>% bayesian_crank() %>% prior_post_plot()
Also this batter had a .300 AVG, we see that his true hitting probability can be anywhere from .250 to .350.
What is the advantage of piping data manipulation operations using
- I agree with the
magrittrauthor that the piping R code is more intuitive to read and write.
- In particular piping seems preferable to the use of multiple parentheses to perform the composition of multiple functions.
- Piping is useful for functions that accept a data frame as the first argument or for functions (such as
lm) that can use a data = . argument.
- It is easy to modify the execution order of the multiple functions by simply cutting and pasting lines of code. Also it is easy to add or subtract functions in the pipe.
- The usefulness of piping goes beyond data manipulation — one can pipe into
ggplot2functions or pipe with functions that are user-defined.