### Attending the SABR Jack Graney Chapter Winter Meeting

Recently, I attended a recent SABR chapter meeting and one of the speakers was Chuck Smith who pitched for the Atlanta Braves during the 2000 and 2001 seasons. Although Smith had a short tenure in MLB, he actually had a long playing career for minor-league teams and teams in South Korea and Taiwan. His talk was an interesting story about endurance and patience and Smith was very grateful for his opportunities playing baseball. It seemed that baseball provided a good preparation for his future career in government and business.

Smith made a couple of comments during his presentation that were particularly interesting to me. Since he played during the so-called Steroids Era, he was asked if any of the big sluggers (Bonds, Sosa, and McGwire) hit home runs against him. Smith said that he was pretty successful pitching against these big sluggers and actually the less powerful batters were the ones that hit home runs against him. (He was effective in throwing off-speed pitches that were effective against the sluggers.)

That raises the question — was Smith telling the truth? Who hit home runs against Chuck Smith and how does this group compare to the group of hitters who did not hit home runs?

I’ll use this question to illustrate how the the piping operator `%>%`

is useful for implementing the step-by-step operations in data manipulation. This piping operator was introduced in the `magrittr`

package and is currently incorporated in the `tidyverse`

collection of packages. After I answer the question about Chuck Smith, I’ll show how this piping operator is helpful in communicating learning by Bayesian methods.

### Who Hit Home Runs Against Chuck Smith?

First I collect the Retrosheet play-by-play data for the 2000 and 2001 seasons stored in the variable `pbp_2_seasons`

. Also from the `Master`

data frame in the Lahman package, I find Smith’s `retroID`

stored in the value `CS_batters$retroID`

.

To find the batters who hit a home run against Smith, (1) I use the `filter`

operation to look at only the plays where a home run was hit and Smith was the pitcher, (2) I collect (`select`

) the `BAT_ID`

values for those batters.

pbp_2_seasons %>% filter(EVENT_CD == 23, PIT_ID == CS$retroID) %>% select(BAT_ID) -> CS_batters

Next, I want to create a data frame containing all batters who had at least 50 at-bats during the 2000-2001 seasons with an indicator if each batter hit a home run against Smith or not.

Here’s what I did starting with the Retrosheet dataset:

- I only consider plays where there was an official at-bat
- I group the data by the batter.
- For each batter, I collect the number of at-bats and count of home runs
- I restrict attention to the batters with at least 50 AB.
- By use of the new variable
`Smith`

, I record if each batter hit a home run against Smith or not. - I merge the data with the
`nameLast`

field of the`Master`

data frame (this will be helpful for putting meaningful labels on the graph to follow).

These operations correspond, line by line, to the R code below that ties this operations by a series of pipes.

pbp_2_seasons %>% filter(AB_FL == TRUE) %>% group_by(BAT_ID) %>% summarize(HR = sum(EVENT_CD == 23), AB = n()) %>% filter(AB >= 50) %>% mutate(Smith = ifelse(BAT_ID %in% CS_batters$BAT_ID, "YES", "NO")) %>% inner_join(select(Master, retroID, nameLast), by=c("BAT_ID" = "retroID")) -> S50

Below I construct parallel boxplots of the home run rates of batters who hit and did not hit home runs against Smith.

What do we see?

- Despite Smith’s statement, Barry Bonds did hit a home run against him, but not Sosa or McGwire.
- The players that hit home runs against Smith tended to have higher home run rates than those who didn’t hit home runs against Smith. Actually that is to be expected since if one did hit a home run against a specific pitcher, this is some evidence that the batter is above average in hitting home runs.
- I did do an additional check that indicated that Chuck Smith’s rate in allowing home runs was actually lower than the average starter in those two seasons.

### Bayesian Piping

In the above work, I illustrated piping using the `dplyr`

data manipulation verbs. But actually piping can be done using any functions of interest.

In my `TeachBayes`

package, I have several functions helpful for communicating Bayes’ rule. The `bayesian_crank`

function will compute posterior probabilities for a data frame with columns `P`

, `Prior`

, `Likelihood`

, and the `prior_post_plot`

function will graphically compare prior and posterior distributions.

Suppose we have a hitter with 30 hits in 100 at-bats for an observe AVG of .300. What have we learned about his true batting average P?

- I construct a discrete uniform prior over the set .1, .11, …, .40.
- I compute the likelihood which is the binomial probability of 30 hits in 100 at-bats viewed as a function of P.
- Then I compute the posterior probabilities
- Compare the prior and posterior probabilities graphically.

library(TeachBayes) library(dplyr) data.frame(p=seq(.10, .40, by = .01), Prior=rep(1/31, 31)) %>% mutate(Likelihood = dbinom(30, size=100, p)) %>% bayesian_crank() %>% prior_post_plot()

Also this batter had a .300 AVG, we see that his true hitting probability can be anywhere from .250 to .350.

### Why Pipe?

What is the advantage of piping data manipulation operations using `%>%`

?

- I agree with the
`magrittr`

author that the piping R code is more intuitive to read and write. - In particular piping seems preferable to the use of multiple parentheses to perform the composition of multiple functions.
- Piping is useful for functions that accept a data frame as the first argument or for functions (such as
`lm`

) that can use a data = . argument. - It is easy to modify the execution order of the multiple functions by simply cutting and pasting lines of code. Also it is easy to add or subtract functions in the pipe.
- The usefulness of piping goes beyond data manipulation — one can pipe into
`ggplot2`

functions or pipe with functions that are user-defined.