Monthly Archives: May, 2014

Team Clutch Hitting

Suppose we’re interested in exploring clutch hitting in baseball. Essentially, scoring runs is a two-stage process – one puts runners on base and then advance them to home. Teams are especially interested in scoring runners who are in scoring position. We'll use Retrosheet data and R to explore the relationship between runners in scoring position and runs scored. Specifically, is there a single number that we can use to summarize this relationship?

We begin by reading in the Retrosheet play-by-play data for the 2013 season. I earlier had created this worksheet by downloading all of the Retrosheet play-by-play files. (See the earlier post which described the process of downloading the Retrosheet files into R.)

load("pbp2013.Rdata")

The variable HALF.INNING is a unique identifier for the game and half inning. Using the summarize function in the dplyr package, I create a new data frame S with two variables: RSP, the number of runners in scoring position, and RUNS the number of these runners who eventually score. (The variables BASE_2_RUN_ID, BASE3_RUN_ID, RUN2_DEST_ID, and RUN3_DEST_ID are helpful here.)

library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
S <- summarize(group_by(d2013, BAT_TEAM, HALF.INNING), RSP = length(unique(c(as.character(BASE2_RUN_ID), 
    as.character(BASE3_RUN_ID)))) - 1, RUNS = sum(RUN2_DEST_ID >= 4) + sum(RUN3_DEST_ID >= 
    4))

Using the cut function I create a categorical variable Cat.RUNS which classifies the runs scored into the classes “0 Runs”, “1 Run”, etc. We use the subset function to only consider the situations when at least one runner is in scoring position.

S$Cat.RUNS <- cut(S$RUNS, breaks = c(-0.5, 0.5, 1.5, 2.5, 3.5, 1000), labels = c("0 Runs", 
    "1 Run", "2 Runs", "3 Runs", "4+ Runs"))
S.RSP <- subset(S, RSP >= 1)

The table function displays all counts for all values of RSP and Cat.Runs.

TB <- with(S.RSP, table(RSP, Cat.RUNS))
TB
##    Cat.RUNS
## RSP 0 Runs 1 Run 2 Runs 3 Runs 4+ Runs
##   1   8817  1908      0      0       0
##   2   1662  2527    566      0       0
##   3     80   723    928    199       0
##   4      4    47    344    378      64
##   5      0     1     31    123     144
##   6      0     0      2      5     115
##   7      0     0      0      0      54
##   8      0     0      0      0      13
##   9      0     0      0      0       2

We see, for example, there were 1908 half-innings where there was exactly one runner in scoring position and that runner scored.

The prop.table function with argument 1 gives the row proportions of the table.

TB <- with(S.RSP, table(RSP, Cat.RUNS))
P <- prop.table(TB, 1)
round(P, 3)
##    Cat.RUNS
## RSP 0 Runs 1 Run 2 Runs 3 Runs 4+ Runs
##   1  0.822 0.178  0.000  0.000   0.000
##   2  0.350 0.531  0.119  0.000   0.000
##   3  0.041 0.375  0.481  0.103   0.000
##   4  0.005 0.056  0.411  0.452   0.076
##   5  0.000 0.003  0.104  0.411   0.482
##   6  0.000 0.000  0.016  0.041   0.943
##   7  0.000 0.000  0.000  0.000   1.000
##   8  0.000 0.000  0.000  0.000   1.000
##   9  0.000 0.000  0.000  0.000   1.000

For example, when there is one runner in scoring position (RSP=1), this runner will score with probability 0.178. When there are two runners in scoring position, both will score with probability 0.119, etc.

I'm interested in exploring the relationship between the number of runners in scoring position and the chance the team scores at least 1 run, the chance the team scores 2 or more runs, and the chance the team scores 3 or more runs. I create a new data frame with three variables Runners.SP, Probability, and Type.

P1plus <- 1 - P[1:5, "0 Runs"]
P2plus <- 1 - P[1:5, "0 Runs"] - P[1:5, "1 Run"]
P3plus <- 1 - P[1:5, "0 Runs"] - P[1:5, "1 Run"] - P[1:5, "2 Runs"]
d1 <- data.frame(Runners.SP = 1:5, Probability = P1plus, Type = "1+ Runs")
d2 <- data.frame(Runners.SP = 1:5, Probability = P2plus, Type = "2+ Runs")
d3 <- data.frame(Runners.SP = 1:5, Probability = P3plus, Type = "3+ Runs")
d <- rbind(d1, d2, d3)

I use the ggplot2 package to plot line plots of P(scoring 1+ runs), P(scoring 2+ runs), and P(scoring 3+ runs) against the number of runners in scoring position.

library(ggplot2)
ggplot(d, aes(Runners.SP, Probability, color = Type)) + geom_line(size = 2) + 
    theme(text = element_text(size = rel(5))) + theme(legend.text = element_text(size = rel(4))) + 
    theme(legend.title = element_blank()) + labs(title = "Clutch Hitting - Probability Scale") + 
    theme(plot.title = element_text(size = rel(6), color = "red"))

mayclutch1

This is a tough graph to interpret since the curves are not linear – this is because the vertical scale (probability) has to fall between 0 and 1 and that causes the curves to have the S shape.

This is a nice way to improve this graph – we reexpress each probability to the logit scale. We create a new variable Logit = log (prob / (1 – prob)). This will be used to transform the P(scoring 1+ runs), P(scoring 2+ runs) and P(scoring 3+ runs). (It takes some practice to get used to thinking of logits. A probability less than 0.5 gets reexpressed to a negative logit, a probability larger than 0.5 gets changed to a positive logit. Probabilities fall between 0 and 1, while logits can take on any value from -infinity to +infinity.)

d$Logit <- with(d, log(Probability/(1 - Probability)))
d$Logit <- ifelse(is.infinite(d$Logit), NaN, d$Logit)

Look what happens when we redraw the graph with the logit (instead of the probability) on the vertical scale.

ggplot(d, aes(Runners.SP, Logit, color = Type)) + geom_line(size = 2) + scale_y_continuous(limits = c(-3, 
    6)) + theme(text = element_text(size = rel(5))) + theme(legend.text = element_text(size = rel(4))) + 
    theme(legend.title = element_blank()) + labs(title = "Clutch Hitting - Logit Scale") + 
    theme(plot.title = element_text(size = rel(6), color = "red"))

clutch2

This is much easier to interpret since we see three parallel lines. On the logit scale, having one more runner in scoring position increases the probability of scoring 1+ runs by about 2.3. Also (since the lines have the same slope), the chance of scoring 2 or more runs increases (on the logit scale) by 2.3 for each additional runner in scoring position. A similar statement can be made about the chance of scoring 3 or more runs (on the logit scale).

In effect, we have reduced a discussion of clutch hitting to a single slope that relates the number of runners in scoring position to the logit of the probability of scoring x+ runs where x can be 1, 2, or 3. In a later post, we'll use this idea to compare the clutch hitting abilities of the 30 teams.