This is the last part of our Welcome to R series. Part 4 of this series focused on the use of the data.table package for selecting rows and columns of data frames. Part 3 illustrated the use of some base R graphics functions like hist(), plot(), etc. Here will introduce the ggplot2 package which is a system for developing high-quality graphics. It is especially nice when you want make comparisons. I’ll introduce ggplot2 within a baseball context, specifically comparing the pitches of two starting pitchers in the 2017 season.
Comparing Two 2017 Phillies Starters
If you look over the 2017 Phillies pitching statistics on Baseball-Reference, you will notice two 24-year old pitchers Aaron Nola and Nick Pivetta who had very different pitching performances this particular season. Nola had a 12-11 record with a 3.54 ERA, and Pivetta was 8-10 with a 6.02 ERA. Let’s use Statcast data to look more carefully at the two pitcher stats at the pitch level, perhaps gaining some insight why Nola was so much better than Pivetta during the 2017 season.
We are going to use the same Statcast data that we used in Part 4 of this series. Using the fread() function from the data.table package, I read the data from my Github site and save it in the data table (also data frame) sc.
library(ggplot2) library(data.table) sc <- fread("https://raw.githubusercontent.com/bayesball/ABWRdata/master/data/statcast2017.txt")
The Statcast ids for these two pitchers are 601713 and 605400. (I found these numbers by googling the player names together with statcast.) Using data.table syntax, I create a new data.table called two_pitchers containing all of the pitches for these two pitchers. It is convenient to create a new variable Pitcher containing the pitcher names. Also I collect the following variables:
- pitch_type — code for the type of pitch
- plate_x, plate_z — location of the pitch as it crosses the zone
- type — indicates if the pitch is a ball, a strike, or in-play
- woba_value — for balls put in play, this gives the woba weight for that event (0 for out, 0.9 for a single, 1.25 for a double, and so on)
- release_speed — speed of the pitch (in mph) when released
- pfx_x, pfx_z — the horizontal and vertical movement (in inches) of the pitch
two_pitchers <- sc[pitcher %in% c(601713, 605400)] two_pitchers <- two_pitchers[, Pitcher := ifelse(pitcher == 601713, "Nick Pivetta", "Aaron Nola") ] two_pitchers <- two_pitchers[, .(Pitcher, pitch_type, plate_x, plate_z, type, woba_value, release_speed, pfx_x, pfx_z)]
Let’s illustrate using ggplot2 to graph from a small dataset. To understand what pitches the two pitchers throw, I construct a frequency table of pitch_type for each pitcher.
(S <- two_pitchers[, .N, by = .(Pitcher, pitch_type)]) Pitcher pitch_type N 1: Nick Pivetta FF 209 2: Nick Pivetta SL 37 3: Aaron Nola CU 110 4: Aaron Nola FF 126 5: Aaron Nola CH 59 6: Aaron Nola FT 91 7: Nick Pivetta FT 43 8: Nick Pivetta CU 79 9: Nick Pivetta CH 15
Starting with ggplot2
To use ggplot2, you need to start with a data frame like S. Next, you have to assign aesthetics or roles to particular variables that you wish to graph. Suppose we want to construct a bar graph of these counts for each pitcher. The x variable (the one assigned to the horizontal axis) will be Pitcher and the y variable (the one assigned to the vertical axis) will be N. I want separate bars for each type of pitch, so I will let the fill variable (that is, the color that fills the bar) be pitch_type. The first step of a ggplot2 graph is to type
where df is a data frame containing the data, and aes() contains all of the aesthetics or roles assigned to the different variables.
Last, we add a geometric object or geom that actually plots the bar or point or something else. Here we use the geom_bar() function that plots a bar — there are two arguments to this particular function — stat = “identity” indicates that the height of the bar should be the count (N) in the data frame and position = “dodge” says that the different pitch types should be plotted with bars that don’t overlap.
Okay, here is the syntax and corresponding graph.
ggplot(S, aes(x = Pitcher, y = N, fill = pitch_type)) + geom_bar(stat = "identity", position = "dodge")
What is interesting is that Nola uses four pitches (changeup, curve ball, four-seamer and two-seamer) with similar frequencies. In contrast, Pivetta is primarily a four-seam pitcher with some curve balls, two-seamers, changeups and sliders.
I can choose to use different geometric objects. For example, I can choose to plot points instead of bars using the geom_point() function. I prefer bars in this particular case, but it might be good to plot points sometimes.
ggplot(S, aes(x = Pitcher, y = N, color = pitch_type)) + geom_point()
Suppose we wish to compare the release speeds of pitches of Nola and Pivetta. Now, since different pitch types have different speeds, it makes sense to compare within pitch type — for example, compare the speeds of four-seamers, compare the speeds of curve balls, etc. In the following ggplot2 code …
- I want to exclude sliders since only Pivetta throws a slider.
- I want to plot the release_speed as the y-variable and the Pitcher on the x-variable.
- I plot points that are jittered to minimize overlapping.
- I want to construct different plots for each pitch type — these different subplots are called facets — and I create these individual plots by use of the facet_wrap() function.
ggplot(two_pitchers[ pitch_type != "SL"], aes(Pitcher, release_speed)) + geom_jitter() + facet_wrap(~ pitch_type)
This graph is interesting with some takeaways. Pivetta throws his four-seamer (FF) faster than Nola on average. Nola has a slower curve ball than Pivetta. Also the speeds of Nola’s curve balls have smaller variation than those of Pivetta — this tells me that Nola has more control over his curve ball.
We decided to plot points by use of the geom_jitter() function. We can instead use the geom_violin() function to construct violin plots for the groups of speed values. Basically, the violin plot is a density plot and its mirror image. The reader might prefer these graphs but I admit they might look funny at first.
ggplot(two_pitchers[ pitch_type != "SL"], aes(Pitcher, release_speed)) + geom_violin() + facet_wrap(~ pitch_type)
Different pitches vary not only by release speed but in terms of horizontal and vertical movement which are measured by the pfx_x and pfx_z variables. In the below graph, the x variable is pfx_x, the y variable is pfx_z and I am using different facets for different pitch types. Also I represent the color of the point by the Pitcher variable — note that I use color = Pitcher which is another aesthetic that is passed along in the aes() function.
ggplot(two_pitchers[ pitch_type != "SL"], aes(pfx_x, pfx_z, color = Pitcher)) + geom_point() + facet_wrap(~ pitch_type)
The size of the movement is different between the two pitchers. For example, look at the CU (curve ball) graph — Nola’s curve balls have a much larger horizontal movement than Pivetta’s curve ball. Pivetta’s four-seamers (look at the FF graph) tend to have more vertical movement than Nola’s.
We can also use a scatterplot with assigned x and y variables to show pitch locations. Below I graph the pitch location (x variable is plate_x, y variable is plate_z) where the color of the point is Pitcher. Note that I’ve added the plate zone to the graph. I have a function add_zone() that adds a ggplot2 layer (the black square) to the graph. (One nice feature of ggplot2 is it’s layered approach to graphing.)
library(CalledStrike) ggplot(two_pitchers[ pitch_type != "SL"], aes(plate_x, plate_z)) + geom_point(aes(color = Pitcher)) + facet_wrap(~ pitch_type) + add_zone('black')
It may be hard to distinguish the red and green points above, so I’ll try another graph to compare the pitch locations. Let’s focus on four-seamers (FF) and curve balls (CU) and I use different facets for each combination of pitch_type and Pitcher.
ggplot(two_pitchers[ pitch_type %in% c("FF", "CU")], aes(plate_x, plate_z)) + geom_point() + facet_grid(pitch_type ~ Pitcher) + add_zone('red')
Remember my comment about the consistency in movement of Nola’s curve balls? Nola seems to have better location on his curve balls. Nola’s curve balls tend to be low in the zone, while Pivetta’s curve balls tend to be all over the place.
Balls put in play
Next let’s focus on pitches put into play which is indicated by type = “X” in our Statcast dataset. We’ll focus on four-seam fastballs and curveballs. For each pitcher and each pitch type, I plot the locations of the pitches where the color of the plotted point corresponds to the woba value of the pitch. In the code below, note that I set the aesthetic color to the character version of the woba_value. There are five possible values of woba_value of 0, 0.9, 1.25, 1.6, 2 corresponding respectively to out, single, double, triple and home run, and each value has a distinct color. For example, note that the two curve balls thrown by Pivetta in the middle of the plate were hit for home runs. This would suggest that the middle of the plate is not a desirable location for Pivetta’s curve ball.
ggplot(two_pitchers[ pitch_type %in% c("FF", "CU") & type == "X"], aes(plate_x, plate_z)) + geom_point(aes(color = as.character(woba_value))) + facet_grid(pitch_type ~ Pitcher) + add_zone('black')
- The purpose of this Welcome to R segment was to introduce ggplot2 and provide how it can be used to produce interesting plots. I especially like its ability to easily produce facets since this is a good method for making graphical comparisons. Also the default parameters of ggplot2 are attractive and they don’t require much extra modification.
- Our comparison of Nola and Pivetta was interesting. Comparing the two pitchers, Nola displays a wider selection of pitch types than Pivetta. Also, comparing curve balls, Nola appears to have a high breaking pitch and he seems very consistent in his delivery, both in speed and location.
- If you want to learn more about ggplot2, a good reference is the Data Visualization chapter of R for Data Science .
- All of this code is available as a Markdown document. To get familiar with ggplot2, try the examples and graphically compare two other pitchers for the 2017 season that you care about.