In the last post, I illustrated reading into R the 2013 Retrosheet play-by-play data. Also, I illustrated computing the run values of all plays using a function version of the R code from our book. Here we use this data to find the best clutch performers in the 2013 season.
We have a data frame
d2013 containing all of the plays. We use the
subset function to restrict attention to plays where there was a batting event (excluding events like attempted steals).
d2013 <- subset(d2013, BAT_EVENT_FL == TRUE)
In my previous function, we added a new variable
STATE which gives the current runners on base and the number of outs. We define a new variable
Scoring.Position which is “yes” if there are runners in scoring position and “no” otherwise.
d2013$Scoring.Position <- with(d2013, ifelse(STATE=="010 0" | STATE=="010 1" | STATE=="010 2" | STATE=="011 0" | STATE=="011 1" | STATE=="011 2" | STATE=="110 0" | STATE=="110 1" | STATE=="110 2" | STATE=="101 0" | STATE=="101 1" | STATE=="101 2" | STATE=="001 0" | STATE=="001 1" | STATE=="001 2" | STATE=="111 0" | STATE=="111 1" | STATE=="111 2", "yes", "no") )
For each batter, we want to compute the number of plate appearances and the mean runs value for batting plays when runners in scoring position, and for other plays. This is conveniently done using the new
library(dplyr) RUNS.VALUE <- summarise(group_by(d2013, BAT_ID, Scoring.Position), PA = n(), meanRUNS = mean(RUNS.VALUE))
Next, we use several applications of
merge to create a new data frame
RUNSsituation . A given row will contain the PA and means runs for a given batter when runners are in SP and not-SP situations. We only consider hitters who have 100 PA’s in each situation.
RUNS.VALUE1 <- subset(RUNS.VALUE, PA >= 100) RUNS.SP <- subset(RUNS.VALUE1, Scoring.Position=="yes") RUNS.NSP <- subset(RUNS.VALUE1, Scoring.Position=="no") RUNSsituation <- merge(RUNS.SP, RUNS.NSP, by="BAT_ID")
We compute the
Mean runs value and the
Difference , the difference between the mean runs values in scoring position and non-scoring position situations.
RUNSsituation$Mean <- with(RUNSsituation, (PA.x * meanRUNS.x + PA.y * meanRUNS.y) / (PA.x + PA.y)) RUNSsituation$Difference <- with(RUNSsituation, meanRUNS.x - meanRUNS.y)
ggplot2 package is used to plot the mean (that we call Performance) against the difference (that we call Clutch). I plot abbreviated player codes so we can easily identify hitters.
library(ggplot2) ggplot(RUNSsituation, aes(Mean, Difference, label=substr(BAT_ID, 1, 4))) + geom_text(color="blue") + geom_hline(yintercept=0, color="red") + geom_vline(xintercept=0, color="red") + xlab("PERFORMANCE") + ylab("CLUTCH")
From the plot we see that Miguel Cabrera and Chris Davis had the highest mean performances and Freddie Freeman and Allen Craig had the best clutch performances using our definition of clutch. B.J. Upton was one of the weakest performers (from a runs value perspective) and also was the worst clutch performers using this measure. What is interesting is that there is a pretty strong positive relationship between performance and clutch. So the best clutch performers tend to be the better hitters. So maybe in our search for clutch players one needs to adjust for level of performance.