There is a lot of interest nowadays in the number of pitches in the plate appearance. We seem to like hitters like Bobby Abreu and Jason Werth who are very patient and tend to have long PA's, and we're also critical of hitters, say Salvador Perez (the person who got the last out in the 2014 World Series), who tend to be impatient and have short PA's.
That raises the question: how much benefit does a hitter have in lengthening the PA?
Here are some initial observations:

Some batters like to swing at the first pitch, thinking it will be a good pitch to hit.

As the pitch count progresses, there will be advantages to the pitcher or the batter which impacts the location and choice of pitch.

When the batter sees a lot of pitches, then this experience might lead to a successful PA. (I'm thinking of the hitter who fouls off a lot of pitches.)
Anyway, here's an initial analysis of the number of pitches effect using 2013 playbyplay data. A common way to measure the goodness of a PA is the run value which we use here. (By the way, Chapter 7 of Analyzing Baseball Data with R looks more carefully at ball and strike effects.)
First, I load in a R worksheet with the 2013 playbyplay data with the run value variable included. (In an earlier post, I give functions for downloading Retrosheet data and creating this run value variable.
load("~/Desktop/PGP Folder/pbp2013.Rdata")
We wish to restrict attention to only batting plays.
d2013 < subset(d2013, BAT_EVENT_FL==TRUE)
From the pitch variable PITCH_SEQ_TX
, we remove pickoff throws, etc. from the variable, so that the remaining characters in the string only correspond to pitches.
d2013$pseq < gsub("[.>123N+*]", "", d2013$PITCH_SEQ_TX)
Using the mutate
function in the dplyr
function, we create the variable N.Pitches
, the number of pitches in the PA.
library(dplyr)
d2013 < mutate(d2013, N.Pitches=nchar(pseq))
To begin, we construct a histogram of the number of pitches in all PA.
plot(table(d2013$N.Pitches), xlab="NUMBER OF PITCHES", ylab="COUNT")
Since the number of pitches larger than 10 is pretty small, we collapse the values > 10 to 10.
d2013 < mutate(d2013, N.Pitches.C = ifelse(N.Pitches > 10, 10, N.Pitches))
We find the mean run value for different number of pitches and graph with a reference line of zero.
S < summarize(group_by(d2013, N.Pitches.C), N=length(RUNS.VALUE), Mean=mean(RUNS.VALUE)) library(ggplot2) ggplot(S, aes(N.Pitches.C, Mean)) + geom_point(size=4) + geom_hline(yintercept=0, size=2, color="red") + labs(title="Run Value of Number of Pitches")
This seems to indicate there is a large advantage to the batter if he puts the first or second pitch in play, there is a big advantage to the pitcher to plate appearances that last 3, 4, or 5 pitches, and there is an increasing advantage to batters past 6 pitches.
But this graph is a bit misleading since we have not considered the effects of strikeouts or walks.
Below we remove the PA's that result in strikeouts.
d2013.no.SO < filter(d2013, EVENT_CD != 3) S1 < summarize(group_by(d2013.no.SO, N.Pitches.C), N=length(RUNS.VALUE), Mean=mean(RUNS.VALUE)) ggplot(S1, aes(N.Pitches.C, Mean)) + geom_point(size=4) + geom_hline(yintercept=0, size=2, color="red") + geom_smooth(size=2, color="green") + labs(title="Run Value, PA's Excluding SO")
So if we remove SO, then all PA lengths are favorable to the hitter. But what happens if we remove walks?
d2013.no.SO.BB < filter(d2013, EVENT_CD != 3, EVENT_CD != 14, EVENT_CD != 15) S2 < summarize(group_by(d2013.no.SO.BB, N.Pitches.C), N=length(RUNS.VALUE), Mean=mean(RUNS.VALUE)) ggplot(S2, aes(N.Pitches.C, Mean)) + geom_point(size=4) + geom_hline(yintercept=0, size=2, color="red") + geom_smooth(size=2, color="green") + labs(title="Run Values, Balls Put in Play")
Now we see that for PA's ignoring SO and BB (that is, for balls put in play), the effect of the number of pitches is more subtle. It is advantageous to the batter to prolong the PA. The best average Run Values (for balls put in play) are for PA's that last 8 and 9 pitches.
Let's put the three situations on the same graph.
ALL < data.frame(N.Pitches=S$N.Pitches.C, Run.Value=S$Mean, Type="All PA") ALL < rbind(ALL, data.frame(N.Pitches=S1$N.Pitches.C, Run.Value=S1$Mean, Type="NO SO")) ALL < rbind(ALL, data.frame(N.Pitches=S2$N.Pitches.C, Run.Value=S2$Mean, Type="In Play")) ggplot(ALL, aes(N.Pitches, Run.Value, color=Type)) + geom_smooth(size=2) + geom_point(size=4) + geom_hline(yintercept=0, size=2, color="red") + labs(title="Run Values: ALL PA, No SO, and InPlay")
It is clear from this graph that the big effects for the number of pitches in the PA are primarily due to strikeouts and walks. But these graphs confirm to some extent what we initially thought. There is a small advantage to putting the ball in play on the first pitch, and there are small advantages to lengthening the PA.
There is certainly more to be done here. For example, it would be interesting to look at the distribution of number of pitches for specific batters. Likewise, it may be advantageous to put the first ball in play for specific pitchers. Also, I wonder about how this distribution has changed across seasons. I’ll let the interested reader do his or her analysis.
Interesting data. But it seems plausible, if not likely, that the hitters who manage to extend PAs to 8 or more pitches are aboveaverage when it comes to balls in play. It’s even possible that the pitchers are below average. So before concluding that there is any advantage in lengthening the PA, you would need to control for pitcher and hitter quality in this pool.
It also seems likely that the first pitch result could be a function of hitter quality.