Graphing Pitch Count Effects – Part III

There seemed to be some interest in the recent graphs I presented that show the run values of PA’s that pass through particular pitch counts.  I’ll explain the R work that I did to create these graphs and then present an enhanced version of these plots that shows the percentage of time a PA will be in different pitch counts. (Thanks to Trevor who suggested this enhancement to the graph.)

R work

  • I start with the Retrosheet play-by-play dataset for the entire 2015 season and add a RUNS.VALUE that gives the runs value for each PA.  (See this post for a description of R code to compute these runs expectancies.) I only consider batting events where BAT_EVENT_FL == TRUE
  • I create a character variable pseq (from the Retrosheet variable PITCH_SEQ_TX that shows the ball-strike sequence of the PA.
  • Similar to what Max illustrated in Chapter 7 in our book, I created variables c01, c10, c02, c11, c20, c12, c21, c22, c31, c32 that indicate (1 or 0) if the PA had that particular pitch count.  (I wrote a function that creates these variables for a single string and then used a function with rbind to do this for all rows of the data frame.  This is very slow — I’m sure there is a quicker way, such as the clever use of regular expressions.)
  • Now that I have the Retrosheet data frame with these extra variables, I wrote a couple of functions.  One function count_plot constructs a basic version of this graph for a specific player.  A second function count_plot_e constructs the enhanced graph where the line weights indicate the percentage of PA’s in different counts.

The Two Plotting Functions

The two plotting functions can be found at my gist site. The two arguments of the function are the Retrosheet play by play data frame with the extra variables and the player’s name in quotes. By default, type = “p” indicates that a pitcher is assumed but you can graph a batter by saying type = “b”. The output is a data frame that displays the mean run values for all counts, and also the percentage of PA’s that were in that count.

Here is the basic plot for Kershaw.

(S <- count_plot(d2015, "Clayton Kershaw"))

   Count        Runs N.Pitches          P balls strikes
   (chr)       (dbl)     (dbl)      (dbl) (dbl)   (dbl)
1    0-0 -0.05717463         0 100.000000     0       0
2    0-1 -0.09550390         1  54.382022     0       1
3    0-2 -0.12880384         2  26.067416     0       2
4    1-0 -0.03725227         1  32.134831     1       0
5    1-1 -0.05994306         2  39.550562     1       1
6    1-2 -0.11271738         3  34.044944     1       2
7    2-0  0.00222083         2   8.988764     2       0
8    2-1 -0.02141827         3  16.629213     2       1
9    2-2 -0.11576039         4  23.707865     2       2
10   3-0  0.10855735         3   2.471910     3       0
11   3-1  0.14949875         4   5.955056     3       1
12   3-2  0.03399738         5   8.988764     3       2


Here is the enhanced plot where the line weight is proportional to the percentage of time that the PA is in that particular count.

S <- count_plot_e(d2015, "Clayton Kershaw")


This graph shows clearly that Kershaw is generally in a pitcher’s count since the lines to the 3-0 and 3-1 counts are relatively thin. Kershaw’s pitching success is partially due to the fact that he is generally ahead in the count.

2 responses

  1. I really like the enhanced plot you created and have been thinking about how to apply something similar to my role in healthcare. I did take this code and graph and added an animation loop (using the animation package and HTML output) around it that pulls in all of Kershaw’s seasonal graphs since his debut in the majors. It is really interesting to see over time how this polygon has changed in shape and position relative to the 0.0 runs value line. And it would be interesting to apply to other pitchers/hitters over time to see the career trajectory over development, peak and decline.

  2. I modified the code to create these graphs for a team by season and it overlays 2 images – one for offense and the 2nd for defense/pitching. Applying it to the Phillies over the 2000-2015 time frame is interesting. You really see the separation in the 2010 & 2011 seasons with the offense polygon shifting higher than the defense polygon as they had the best record in the majors both years. And that separation is more noticeable in certain counts. Sadly, the opposite effect is observed in the past 2 seasons.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: