Creating HexBin Plots

Kirk Goldsberry has attracted a lot of attention with his “geographic” shot charts for NBA players. These are examples of “hexbin” plots. Luckily, the hexbin package for R provides the ability to quickly similar plots. Here, we’ll show how to create a few quick hexbin plots using the MLBAM data.

Carlos Gomez has been in the news recently – let’s focus on him. We’ll start by loading the openWAR data for 2013, and locating Gomez’s MLBAM player ID. [Of course, you can also do this with a web query.]

require(openWAR)
data(MLBAM2013)
playerNames <- unique(MLBAM2013\$batterName)
playerNames[grep("Gomez", playerNames)]
##  Gomez, C Gomez, J
## 1280 Levels: Altuve Andrus Ankiel Barnes, B Bedard Beltre, A ... Kiermaier
gomezId = unique(subset(MLBAM2013, batterName == "Gomez, C")\$batterId)
gomez = subset(MLBAM2013, playerId.CF == gomezId)

From this subset, we can compute how many balls Gomez caught while playing CF in 2013. In the MLBAM data, the fielderId field contains the ID of the player who first fielded the ball.

require(mosaic)
head(sort(tally(~event, data=subset(gomez, fielderId == playerId.CF)), decreasing=TRUE))
##
##             Flyout            Lineout            Sac Fly
##                287                 90                 13
## Caught Stealing 2B   Defensive Indiff   Defensive Switch
##                  0                  0                  0

This confirms that Gomez caught each of these 390 balls. Note that we can also calculate statistics when Gomez was playing CF, including the groundout-to-air out ratio of the Brewers, and the batting average on balls in play of their opponents.

require(plyr)
ddply(gomez, ~playerId.CF, summarise, N = length(playerId.CF), G = length(unique(gameId)), BIP = sum(isBIP), PO = sum(fielderId == playerId.CF, na.rm=TRUE), "GO/AO" = length(grep("Ground", event)) / length(grep("Fly", event)), BABIP = sum(isHit) / sum(isAB) )
##   playerId.CF    N   G  BIP  PO GO/AO  BABIP
## 1      460576 5281 145 3358 390 1.561 0.2575

As we saw before, we can use the plot() method to visualize where Gomez’s catches were on the field.

plot(subset(gomez, fielderId == playerId.CF))

While this plot has the advantage of showing us the individual balls that Gomez caught, it can sometime be hard to visually aggregate these data. A hexbin plot will do that for us.

Let’s try a simple hexbinplot().

require(hexbin)
hexbinplot(our.y ~ our.x, data=subset(gomez, fielderId == playerId.CF))

This plot, while a technically accurate representation of the data, is nearly meaningless because the data is not presented with any context. This is a common problem in statistics – let’s see if we can solve it.

In this case, the lines that illustrate the baseball diamond in the previous plot would really help us to understand the locations these hexbins. Luckily, this generic baseball diamond is drawn by the panel.baseball() function in openWAR. If you are familiar with lattice graphics in R, panel.baseball() works like any other panel function – it simply adds this baseball layout to your plot.

What’s great about this is that you can use panel.baseball() to overlay this field onto any lattice plot, and hexbinplot() happens to be a lattice plot. So for example, we can put the baseball diamond onto the hexplot quite easily.

hexbinplot(our.y ~ our.x, data=subset(gomez, fielderId == playerId.CF)
, panel = function(x,y,...) {
panel.baseball()
panel.hexbinplot(x,y,...)
}
)

That actually worked, but it didn’t help much because the margins are not wide enough, we didn’t label the axes, and the hexbins are too small. Also, we can add color and change the number of colors used. A few tweaks will improve things considerably.

my.colors <- function (n) {
rev(heat.colors(n))
}
hexbinplot(our.y ~ our.x, data=subset(gomez, fielderId == playerId.CF), xbins = 10
, panel = function(x,y, ...) {
panel.baseball()
panel.hexbinplot(x,y,  ...)
}
, xlim = c(-350, 350), ylim = c(-20, 525)
, xlab = "Horizontal Distance from Home Plate (ft.)"
, ylab = "Vertical Distance from Home Plate (ft.)"
, colramp = my.colors, colorcut = seq(0, 1, length = 10)
) Of course, we’re interested in how Gomez compares to all centerfielders.

hexbinplot(our.y ~ our.x, data=subset(MLBAM2013, fielderId == playerId.CF), xbins = 50
, panel = function(x,y, ...) {
panel.baseball()
panel.hexbinplot(x,y,  ...)
}
, xlim = c(-350, 350), ylim = c(-20, 525)
, xlab = "Horizontal Distance from Home Plate (ft.)"
, ylab = "Vertical Distance from Home Plate (ft.)"
, colramp = my.colors, colorcut = seq(0, 1, length = 10)
) It might be more instructive to compare him to a handful of other centerfielders.

key = unique(subset(MLBAM2013, batterName %in% c("Trout", "Upton, B", "Gomez, C", "Ellsbury"), select=c("batterId", "batterName")))
comp = subset(MLBAM2013, playerId.CF %in% key\$batterId & fielderId == playerId.CF)

hexbinplot(our.y ~ our.x | as.factor(playerId.CF), data=comp, xbins = 10
, panel = function(x,y, ...) {
panel.baseball()
panel.hexbinplot(x,y,  ...)
}
, xlim = c(-350, 350), ylim = c(-20, 525)
, xlab = "Horizontal Distance from Home Plate (ft.)"
, ylab = "Vertical Distance from Home Plate (ft.)"
, colramp = my.colors, colorcut = seq(0, 1, length = 10)
, strip = strip.custom(factor.levels = as.character(key\$batterName))
)

Note that we have not controlled for playing time here. Unlike Goldsberry’s charts, where the variable being plotted is a percentage, what we are plotting here is just a count. In effect, the hexbin plot is a binned, two-dimensional, color-coded histogram.

1. 2. 