Suppose we’re interested in exploring clutch hitting in baseball. Essentially, scoring runs is a two-stage process – one puts runners on base and then advance them to home. Teams are especially interested in scoring runners who are in scoring position. We'll use Retrosheet data and R to explore the relationship between runners in scoring position and runs scored. Specifically, is there a single number that we can use to summarize this relationship?

We begin by reading in the Retrosheet play-by-play data for the 2013 season. I earlier had created this worksheet by downloading all of the Retrosheet play-by-play files. (See the earlier post which described the process of downloading the Retrosheet files into R.)

load("pbp2013.Rdata")

The variable HALF.INNING is a unique identifier for the game and half inning. Using the ` summarize `

function in the dplyr package, I create a new data frame ` S `

with two variables: RSP, the number of runners in scoring position, and RUNS the number of these runners who eventually score. (The variables BASE_2_RUN_ID, BASE3_RUN_ID, RUN2_DEST_ID, and RUN3_DEST_ID are helpful here.)

library(dplyr)

## ## Attaching package: 'dplyr' ## ## The following objects are masked from 'package:stats': ## ## filter, lag ## ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union

S <- summarize(group_by(d2013, BAT_TEAM, HALF.INNING), RSP = length(unique(c(as.character(BASE2_RUN_ID), as.character(BASE3_RUN_ID)))) - 1, RUNS = sum(RUN2_DEST_ID >= 4) + sum(RUN3_DEST_ID >= 4))

Using the ` cut `

function I create a categorical variable Cat.RUNS which classifies the runs scored into the classes “0 Runs”, “1 Run”, etc. We use the subset function to only consider the situations when at least one runner is in scoring position.

S$Cat.RUNS <- cut(S$RUNS, breaks = c(-0.5, 0.5, 1.5, 2.5, 3.5, 1000), labels = c("0 Runs", "1 Run", "2 Runs", "3 Runs", "4+ Runs")) S.RSP <- subset(S, RSP >= 1)

The ` table `

function displays all counts for all values of RSP and Cat.Runs.

TB <- with(S.RSP, table(RSP, Cat.RUNS)) TB

## Cat.RUNS ## RSP 0 Runs 1 Run 2 Runs 3 Runs 4+ Runs ## 1 8817 1908 0 0 0 ## 2 1662 2527 566 0 0 ## 3 80 723 928 199 0 ## 4 4 47 344 378 64 ## 5 0 1 31 123 144 ## 6 0 0 2 5 115 ## 7 0 0 0 0 54 ## 8 0 0 0 0 13 ## 9 0 0 0 0 2

We see, for example, there were 1908 half-innings where there was exactly one runner in scoring position and that runner scored.

The ` prop.table `

function with argument 1 gives the row proportions of the table.

TB <- with(S.RSP, table(RSP, Cat.RUNS)) P <- prop.table(TB, 1) round(P, 3)

## Cat.RUNS ## RSP 0 Runs 1 Run 2 Runs 3 Runs 4+ Runs ## 1 0.822 0.178 0.000 0.000 0.000 ## 2 0.350 0.531 0.119 0.000 0.000 ## 3 0.041 0.375 0.481 0.103 0.000 ## 4 0.005 0.056 0.411 0.452 0.076 ## 5 0.000 0.003 0.104 0.411 0.482 ## 6 0.000 0.000 0.016 0.041 0.943 ## 7 0.000 0.000 0.000 0.000 1.000 ## 8 0.000 0.000 0.000 0.000 1.000 ## 9 0.000 0.000 0.000 0.000 1.000

For example, when there is one runner in scoring position (RSP=1), this runner will score with probability 0.178. When there are two runners in scoring position, both will score with probability 0.119, etc.

I'm interested in exploring the relationship between the number of runners in scoring position and the chance the team scores at least 1 run, the chance the team scores 2 or more runs, and the chance the team scores 3 or more runs. I create a new data frame with three variables Runners.SP, Probability, and Type.

P1plus <- 1 - P[1:5, "0 Runs"] P2plus <- 1 - P[1:5, "0 Runs"] - P[1:5, "1 Run"] P3plus <- 1 - P[1:5, "0 Runs"] - P[1:5, "1 Run"] - P[1:5, "2 Runs"] d1 <- data.frame(Runners.SP = 1:5, Probability = P1plus, Type = "1+ Runs") d2 <- data.frame(Runners.SP = 1:5, Probability = P2plus, Type = "2+ Runs") d3 <- data.frame(Runners.SP = 1:5, Probability = P3plus, Type = "3+ Runs") d <- rbind(d1, d2, d3)

I use the ` ggplot2 `

package to plot line plots of P(scoring 1+ runs), P(scoring 2+ runs), and P(scoring 3+ runs) against the number of runners in scoring position.

library(ggplot2) ggplot(d, aes(Runners.SP, Probability, color = Type)) + geom_line(size = 2) + theme(text = element_text(size = rel(5))) + theme(legend.text = element_text(size = rel(4))) + theme(legend.title = element_blank()) + labs(title = "Clutch Hitting - Probability Scale") + theme(plot.title = element_text(size = rel(6), color = "red"))

This is a tough graph to interpret since the curves are not linear – this is because the vertical scale (probability) has to fall between 0 and 1 and that causes the curves to have the S shape.

This is a nice way to improve this graph – we reexpress each probability to the logit scale. We create a new variable Logit = log (prob / (1 – prob)). This will be used to transform the P(scoring 1+ runs), P(scoring 2+ runs) and P(scoring 3+ runs). (It takes some practice to get used to thinking of logits. A probability less than 0.5 gets reexpressed to a negative logit, a probability larger than 0.5 gets changed to a positive logit. Probabilities fall between 0 and 1, while logits can take on any value from -infinity to +infinity.)

d$Logit <- with(d, log(Probability/(1 - Probability)))

d$Logit <- ifelse(is.infinite(d$Logit), NaN, d$Logit)

Look what happens when we redraw the graph with the logit (instead of the probability) on the vertical scale.

ggplot(d, aes(Runners.SP, Logit, color = Type)) + geom_line(size = 2) + scale_y_continuous(limits = c(-3, 6)) + theme(text = element_text(size = rel(5))) + theme(legend.text = element_text(size = rel(4))) + theme(legend.title = element_blank()) + labs(title = "Clutch Hitting - Logit Scale") + theme(plot.title = element_text(size = rel(6), color = "red"))

This is much easier to interpret since we see three parallel lines. On the logit scale, having one more runner in scoring position increases the probability of scoring 1+ runs by about 2.3. Also (since the lines have the same slope), the chance of scoring 2 or more runs increases (on the logit scale) by 2.3 for each additional runner in scoring position. A similar statement can be made about the chance of scoring 3 or more runs (on the logit scale).

In effect, we have reduced a discussion of clutch hitting to a single slope that relates the number of runners in scoring position to the logit of the probability of scoring x+ runs where x can be 1, 2, or 3. In a later post, we'll use this idea to compare the clutch hitting abilities of the 30 teams.

This might be a stupid question, but I am not seeing the field BAT_TEAM in my 2013 event data. Could this be a version issue or a derived field I need to compute? I tried searching for “Retrosheet BAT_TEAM” in Google but that was not helpful.

BAT_TEAM is a variable I defined (sorry it was not clear.) In Retrosheet, you have the home team id from the GAME_ID and the away team id from AWAY_TEAM_ID. The Retrosheet variable BAT_HOME_ID tells you which half of the inning you are in. From these three variables, I used a ifelse statement to define BAT_TEAM, the id the team that is batting.

Thanks! I found it harder than I expected to create BAT_TEAM as a factor, so in case anyone is interested here is what I did:

d2013$BAT_TEAM <- as.factor(with(d2013, ifelse(BAT_HOME_ID == 1, substr(GAME_ID,1,3), as.character(AWAY_TEAM_ID))))

BTW, I saw some small differences in the Cat.RUNS tables for my run. Not enough to appreciably change the plots, but odd.