There seemed to be some interest in the recent graphs I presented that show the run values of PA’s that pass through particular pitch counts. I’ll explain the R work that I did to create these graphs and then present an enhanced version of these plots that shows the percentage of time a PA will be in different pitch counts. (Thanks to Trevor who suggested this enhancement to the graph.)
- I start with the Retrosheet play-by-play dataset for the entire 2015 season and add a
RUNS.VALUEthat gives the runs value for each PA. (See this post for a description of R code to compute these runs expectancies.) I only consider batting events where
BAT_EVENT_FL == TRUE
- I create a character variable
pseq(from the Retrosheet variable
PITCH_SEQ_TXthat shows the ball-strike sequence of the PA.
- Similar to what Max illustrated in Chapter 7 in our book, I created variables
c01, c10, c02, c11, c20, c12, c21, c22, c31, c32that indicate (1 or 0) if the PA had that particular pitch count. (I wrote a function that creates these variables for a single string and then used a
rbindto do this for all rows of the data frame. This is very slow — I’m sure there is a quicker way, such as the clever use of regular expressions.)
- Now that I have the Retrosheet data frame with these extra variables, I wrote a couple of functions. One function
count_plotconstructs a basic version of this graph for a specific player. A second function
count_plot_econstructs the enhanced graph where the line weights indicate the percentage of PA’s in different counts.
The Two Plotting Functions
The two plotting functions can be found at my gist site. The two arguments of the function are the Retrosheet play by play data frame with the extra variables and the player’s name in quotes. By default, type = “p” indicates that a pitcher is assumed but you can graph a batter by saying type = “b”. The output is a data frame that displays the mean run values for all counts, and also the percentage of PA’s that were in that count.
Here is the basic plot for Kershaw.
library(devtools) source_gist("d1c3e86ec09eb4895befd814de2699b5") (S <- count_plot(d2015, "Clayton Kershaw")) Count Runs N.Pitches P balls strikes (chr) (dbl) (dbl) (dbl) (dbl) (dbl) 1 0-0 -0.05717463 0 100.000000 0 0 2 0-1 -0.09550390 1 54.382022 0 1 3 0-2 -0.12880384 2 26.067416 0 2 4 1-0 -0.03725227 1 32.134831 1 0 5 1-1 -0.05994306 2 39.550562 1 1 6 1-2 -0.11271738 3 34.044944 1 2 7 2-0 0.00222083 2 8.988764 2 0 8 2-1 -0.02141827 3 16.629213 2 1 9 2-2 -0.11576039 4 23.707865 2 2 10 3-0 0.10855735 3 2.471910 3 0 11 3-1 0.14949875 4 5.955056 3 1 12 3-2 0.03399738 5 8.988764 3 2
Here is the enhanced plot where the line weight is proportional to the percentage of time that the PA is in that particular count.
S <- count_plot_e(d2015, "Clayton Kershaw")
This graph shows clearly that Kershaw is generally in a pitcher’s count since the lines to the 3-0 and 3-1 counts are relatively thin. Kershaw’s pitching success is partially due to the fact that he is generally ahead in the count.