If you have been watching the National League and American League Championship Series the last few days, you have noticed that these games have tended to be low-scoring affairs. Since I have been thinking about ball-strike counts recently, I wondered if these recent games have had unusual distributions of ball-strike counts. Specifically, we know that counts can either favor the pitcher (think 0-2) or the batter (think 3-1). One explanation for low-scoring games is that batters tend to face more pitcher counts and fewer batter counts. Is there any evidence that this conjecture is true?
Some Baseline Data
To get a handle on a typical distribution of ball-strike counts, I collected PitchFX data from all games in April 2016 — a graph of the proportions in different counts is shown below.
Honestly, this graph doesn’t tell us much. Obviously we have a lot of 0-0 counts, followed by 0-1, 1-0, 1-1, 1-2, etc. Some counts are rare, such as 3-0 and 3-1.
Working with Proportions
We are interested in seeing how the distribution of ball-strike outs in the 2016 playoffs differs from the typical distribution shown above. The problem is that we will be comparing some large proportions (for 0-1, 1-0 counts) for the 2016 playoffs and the 2016 regular season, and also comparing some small proportions (for 3-0 and 3-1 counts) for the two groups. There is a variation issue — small fractions tend to be less variable than large fractions. (For example, the proportions of voters for Donald Trump across different states tend to be more variable, that is show more spread, than the proportions of voters for Gary Johnson across different states.) We need to put proportions on a better scale so that the variability in small proportions and the variability in large proportions are similar. From this comparison perspective, it is better to reexpress a proportion to a logit
logit = log(proportion) – log(1 – proportion)
So when I compute proportions, say the proportion of counts that are 0-2 in two different scenarios, I will convert each proportion to a logit, and then make any comparisons on the logit scale.
Obtaining Data from the 2016 Playoffs
Okay, so the plan is to collect the proportions of each possible ball-strike count, convert all of the proportions to logits, and then compare logits for the 2016 Playoffs (that is, the recent 7 games in the NL and AL championship series) with the baseline logits for the 2016 season.
Here is an example of the calculations for AL Game 2 — for each ball-strike count, I have (1) the number of times the ball-strike count occurred, (2) the proportion, and (3) the overall proportion based on the April 2016 data. The logic and overall_logit columns convert the proportions to logits, and the Difference column contains the difference logit – overall_logit.
PitchCount Count Proportion Game Overall Type logit overall_logit Difference 1 c00 62 0.235 AL_Game_2 0.260 Neutral -1.181 -1.047 -0.135 2 c01 28 0.106 AL_Game_2 0.126 Pitchers -2.132 -1.935 -0.196 3 c02 13 0.049 AL_Game_2 0.063 Pitchers -2.961 -2.701 -0.260 4 c10 26 0.098 AL_Game_2 0.103 Hitters -2.214 -2.163 -0.051 5 c11 26 0.098 AL_Game_2 0.102 Neutral -2.214 -2.180 -0.034 6 c12 33 0.125 AL_Game_2 0.092 Pitchers -1.946 -2.288 0.343 7 c20 8 0.030 AL_Game_2 0.036 Hitters -3.466 -3.276 -0.189 8 c21 13 0.049 AL_Game_2 0.054 Neutral -2.961 -2.865 -0.096 9 c22 28 0.106 AL_Game_2 0.079 Pitchers -2.132 -2.451 0.319 10 c30 3 0.011 AL_Game_2 0.012 Hitters -4.466 -4.373 -0.093 11 c31 7 0.027 AL_Game_2 0.023 Hitters -3.603 -3.757 0.154 12 c32 17 0.064 AL_Game_2 0.049 Neutral -2.676 -2.959 0.283
The below graph summaries the results, broken down by the 7 playoff games in the championship series. The vertical line at zero represents no change in logit from regular season to the 2016 playoffs. Points to the right of the line represent higher proportions than baseline in the playoffs, and points to the left represent lower proportions than baseline in the playoffs. I have colored the point by the type of count — either hitter’s count, pitcher’s count, or neutral.
It is hard to look at all of these pitch counts and maybe some comparisons aren’t that significant due to small sample sizes. To make the comparison clearer, maybe it is better to aggregate the data by the type of count — here is the graph below.
- Let’s take a game which was clearly pitching dominated, say the AL Game 2 which was won by Cleveland 2-1 (there were only 7 hits in the game). Looking at the graph (top right), there were a high (relative to normal) fraction of pitcher ball-strike counts and low neutral and hitter counts. So this makes sense.
- Let’s move to NL Game 2 which was also pitching dominate (won by the Dodgers 1-0 with a total of 5 hits). Interestingly, in this game (middle left) there were a high proportion of hitter counts and a low proportion of pitcher counts. I can check my work, but maybe this means that the hitters were poor in taking advantage of hitters counts in this particular game.
- In the most recent AL Game 4 (won by Toronto), there were a high number of batter counts — maybe Toronto was good in taking advantage of these good counts as they did score some runs.
This post was motivated by several thoughts: (1) it seemed like an interesting question to look at, and (2) I can use this example to describe the benefit of expressing proportions to logits in making effective comparisons. (I am currently teaching a exploratory data analysis course where reexpression is one of the major themes.)
From a hitter’s perspective, there are two issues. First, he would like the pitch count in his favor since he may be getting pitches that are easier to hit. But, also he needs to take advantages of the favorable pitch counts. We’ve explored one of these issues here. A more complete study would look at both the pitch count distribution and how the hitters take advantage (or don’t take advantage) of these pitch-count situations.