In Chapter 7 of Analyzing Baseball with R, we look at ball and strike effects. Specifically, we look at pitch sequences that pass through different pitch counts and see the effect on expected run values. I’m currently revising my Teaching Statistics Using Baseball book. One of my chapters is on modeling baseball by a Markov Chain and I thought that the pitch count sequence would be a good illustration of a Markov Chain that I could add as an exercise. Here I illustrate how one can compute the pitch count transitions using Retrosheet play-by-play data.

Here’s an outline of my work with some interesting graphs. I am assuming that one has the Retrosheet play-by-play data for a specific season — here I have the 2014 data stored in the data frame ` pbp.14 `

.

The variable ` PITCH_SEQ_TX `

has the pitch sequence. I use the ` gsub `

function to remove all non-pitches from the sequence.

pbp.14$pseq <- gsub("[.>123N+*]", "", pbp.14$PITCH_SEQ_TX)

I recode to either balls (b) or strikes (s).

pbp.14$pseq <-gsub("[BIPV]", "b", pbp.14$pseq) pbp.14$pseq <-gsub("[CFKLMOQRST]", "s", pbp.14$pseq)

I wrote a function ` one.string `

to extract all of the pitch counts from a single character string with balls and strikes. It returns the beginning and end count for each pitch — the end of the PA is coded by “X”.

one.string <- function(ex){ # replace s and b with X for strikeouts and walks ex <- gsub("s$", "X", ex) ex <- gsub("b$", "X", ex) # create a vector of individual outcomes ex.v <- unlist(strsplit(ex,"")) # remove last X from vector ex.v <- ex.v[-length(ex.v)] # compute cumulative total of balls and strikes n.balls <- cumsum(ex.v == "b") n.strikes <- pmin(cumsum(ex.v == "s"), 2) # create pitch count variable S <- paste(n.balls, n.strikes, sep="-") # add a beginning and end outcome S <- c("0-0", S, "X") # before and after counts b.count <- S[1:(length(S) - 1)] e.count <- S[-1] list(b.count, e.count) }

I use the ` sapply `

function to apply this function to all pitch sequence strings.

S <- sapply(pbp.14$pseq, one.string)

Finally, I use the ` table `

function to tabulate the transitions in pitch counts. Using this data, we can represent the pitch sequence sequence as a Markov Chain with absorbing state “end of the PA”. The matrix ` P `

, computed below, gives the transition matrix — for example, the matrix value ` P["0-1", "1-1"] `

gives the probability of moving from a 0-1 count to a 1-1 count.

TR <- table(unlist(S[1, ]), unlist(S[2, ])) P <- prop.table(TR[1:12, -12], 1) P <- rbind(P, c(rep(0, 11), 1)) P <- cbind(rep(0, 13), P) dimnames(P)[[1]][13] <- "X" dimnames(P)[[2]][1] <- "0-0"

Here’s the first row of this transition matrix. From a 0-0, we’ll either move to 0-1, 1-0, or “in-play” with probabilities .49, .39, and .12.

round(P[1, ], 2) 0-0 0-1 0-2 1-0 1-1 1-2 2-0 2-1 2-2 3-0 3-1 3-2 X 0.00 0.49 0.00 0.39 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.12

Here are a few interesting graphs from this transition matrix. The first shows the probability of adding a strike to the count for each of the initial counts. We see the probability of adding a strike is relatively low for 0-1, but relatively high for a 3-0 count

The second shows the probability of adding a ball to the count. Here we see that it is more likely to add a ball on a 0-2 count, but less likely to add a ball to 2-1 and 2-2 counts.

Last, we show the probability of keeping the two-strike count (with a foul ball). As you might expect, the probability of keeping the count is highest for 3-2, followed by 2-2, 1-2, and 0-2.

This is interesting stuff, especially when one explores how these pitch count transitions depend on other variables such as home/away and umpire. (For example, it is more likely to move from 0-0 to 0-1 when the batter is from the visiting team.)

I’ll leave that analysis to the interested reader.

All of the code for this example can be found on my gist site.