Measuring Strike Zone Contour Areas

We’ve talked a good bit about Pitch f/x data on this site, and particularly with strike zones. Aaron has helped out with defining the way we think about the zone, and Carson has provided great tools in R through pitchRx that allow us to do this work smoothly.

Today, I’m going to take things a step further, and talk about measuring the area of contours that we create from a Generalized Additive Model (GAM) of the strike zone. We’ll think about this from a few perspectives: 1) What is the area of the zone for a given strike probability contour, 2) How much of that contour area resides within the rulebook strike zone (a specificity-like measure), and 3) How much of the rulebook strike zone is filled with our strike zone contour (a sensitivity-like measure). This uses some modified code from this working paper, with the same types of GAM models, and we’ll reproduce a figure similar to Figure 13 (on page 57).

In going through this exercise, we’ll need to use four new packages to our library: mgcv, parallel, PBSmapping, and  splancs. For now, go and get the data set from my website. This data is an abridged version of all called pitches for Joe West from 2008 through 2013 from my personal Pitch f/x database (leaving out pitch-outs and intentional balls). Once you’ve downloaded it on your machine, go ahead and load it into R, then load the libraries as shown below:

###load libraries
library(mgcv)
library(parallel)
library(splancs)
library(PBSmapping)

###Load Data
setwd("e:/Dropbox/Baseball With R")
called <- read.csv(file="joeWest.csv", h=T)
head(called)

  first last year inning bottomInn p_throws stand     px    pz calledStrike pitch_type
1   Joe West 2008      8         0        R     R  0.901 2.977            0         SL
2   Joe West 2008      8         0        R     R  0.392 2.149            1         FA
3   Joe West 2008      8         0        R     L -1.291 3.342            0         FA
4   Joe West 2008      8         0        R     L -0.003 1.886            1         FA
5   Joe West 2008      8         0        R     L -1.750 3.215            0         FA
6   Joe West 2008      6         1        L     R -0.767 2.208            1         FA

nrow(called)
[1] 33296
nrow(subset(called, called$stand=="R"))
[1] 18621
nrow(subset(called, called$stand=="L"))
[1] 14675

You should have 33,296 total called pitches for Joe West in the data set, with about 4,000 more pitches to right handed batters than left handed batters. Make sure that your output looks like above. Notice that I’ve also provided the pitch_type, inning, year, pitcher handedness (p_throws), and whether the home team was batting (bottomInn). Our variable of interest here will be the px (horizontal) and pz (vertical) location of each pitch, as well as the calledStrike variable that takes the value 1 if the pitch was called a strike, and 0 if the pitch was a ball. The px and pz measures are in feet. px measures the distance from the center of the plate from the perspective of the catcher/umpire. Those taking negative values are on the inner half of the plate to right handed batters, while those taking positive values are on the outer half of the plate to a right handed batter. pz measures the height of the pitch above the ground. Sometimes this can take a negative value if the ball was in the dirt. There are 544 pitches with missing locational coordinates, but R will ignore these in our models automatically.

Alright, now we’re ready to go. PBSmapping and splancs are going to allow us to measure areas of contours easily, while parallel allows us to estimate our GAMs from mgcv a bit more quickly on systems with multiple cores.

Next, we want to go ahead and tell R that we want to use all but one of our cores/threads on our computer to estimate our models. So, if you have a 4-core computer, you’ll spread the estimation across 3 of them. If you have 6 cores, then you’ll use 5 of them, and so on. By dividing up the estimation into chunks, we’ll be able to get on with our visuals and measurements, rather than waiting for a long time for the models to run. Note that I’m running an Xeon W3670 with 6 cores @ 3.20GHz with 32 GB of RAM. I recommend at least 8 GB of RAM on your machine, though you might be able to get away with 6, as the function we’ll use to estimate our models is relatively efficient with memory. With these specifications, my computer ran the two GAM models fit below, and created the predicted values for each, in just under 50 seconds.

Ok, now let’s fire up our models. First, I generate a matrix of data for our predict function so that we can map a predicted two-dimensional strike zone. Then I’ll separately fit the model for right handed batters using mgcv, followed by left handed batters. We’ll use the output from the predicted strike probabilities at the given matrix coordinates for our zone area calculations.

###indicate number of cores used for parallel processing
if (detectCores()>1) {
cl <- makeCluster(detectCores()-1)
} else cl <- NULL
cl

###generate new data sequence for visualization
myx <- matrix(data=seq(from=-1.5, to=1.5, length=100), nrow=100, ncol=100)    
myz <- t(matrix(data=seq(from=1,to=4, length=100), nrow=100, ncol=100))    
fitdata <- data.frame(px=as.vector(myx), pz=as.vector(myz))

###set the random seed for cross-validation
set.seed(72684)

###fit RHB strike zone surface
fit.R <- bam(calledStrike ~ s(px, pz, k=50), method="GCV.Cp", data=subset(called, called$stand=="R"), family=binomial(link="logit"), cluster=cl)
mypredict.R <- predict(fit.R, fitdata, type="response")
mypredict.R <- matrix(mypredict.R, nrow=c(100,100))

###fit LHB strike zone surface
fit.L <- bam(calledStrike ~ s(px, pz, k=50), method="GCV.Cp", data=subset(called, called$stand=="L"), family=binomial(link="logit"), cluster=cl)
mypredict.L <- predict(fit.L, fitdata, type="response")
mypredict.L <- matrix(mypredict.L, nrow=c(100,100))

Now that we have our models, let’s go ahead and draw the 50% contours—or, the locations at which Joe West is equally likely to call a pitch a strike or a ball—onto a plot to make sure they look reasonable. Any pitch inside this contour has at least a 50% probability of being called a strike. Notice there are some strange outcomes on the LHB zone, and this could be a sample size issue. But I’m comfortable with going forward here.

###Draw 50% Strike Zone Contours
png(file="JoeWestContourPlot.png", width=828, height=450)
par(mfrow=c(1,2))
###Right Handed Batters
contour(x=seq(from=-1.5, to=1.5, length=100), y=seq(from=1, to=4, length=100),
    z=mypredict.R, lwd=3, lty="solid", axes=T, levels=.5, labels="",
    labcex=1.3, col="darkred", xlab="Horizontal Location (ft., Umpire's View)",
    ylab="Vertical Location (ft.)", main="RHB 50% Strike Zone Contour")
rect(-0.83, 1.52, 0.83, 3.42, border="black", lty="dotted")
text(0, 2.6, "> 50% Strike Calls", col="black", cex=1.2)
text(0.75, 1.1, "< 50% Strike Calls", col="black", cex=1.2)
###Left Handed Batters
contour(x=seq(from=-1.5, to=1.5, length=100), y=seq(from=1, to=4, length=100),
    z=mypredict.L, lwd=3, lty="solid", axes=T, levels=.5, labels="",
    labcex=1.3, col="darkred", xlab="Horizontal Location (ft., Umpire's View)",
    ylab="Vertical Location (ft.)", main="LHB 50% Strike Zone Contour")
rect(-0.83, 1.52, 0.83, 3.42, border="black", lty="dotted")
text(0, 2.6, "> 50% Strike Calls", col="black", cex=1.2)
text(0.75, 1.1, "< 50% Strike Calls", col="black", cex=1.2)
dev.off()

JoeWestContourPlot

Next, we’ll use the splancs package—and specifically the areapl function—to measure the area inside the contours. More specifically, we’ll actually measure the area inside the contours not just for the 50% probability line, but intervals from 10% strike probability through 90% strike probability. While I won’t do much with these here, hopefully we can come back to this in another post.

###RHB Zone Area
clines.R <- contourLines(x=seq(from=-1.5, to=1.5, length=100), y=seq(from=1, to=4, length=100), mypredict.R)
areas.R <- c(area.R.10 <- round(with(clines.R[[1]], areapl(cbind(12*x,12*y))), 3),
    area.R.20 <- round(with(clines.R[[2]], areapl(cbind(12*x,12*y))), 3),
    area.R.30 <- round(with(clines.R[[3]], areapl(cbind(12*x,12*y))), 3),
    area.R.40 <- round(with(clines.R[[4]], areapl(cbind(12*x,12*y))), 3),
    area.R.50 <- round(with(clines.R[[5]], areapl(cbind(12*x,12*y))), 3),
    area.R.60 <- round(with(clines.R[[6]], areapl(cbind(12*x,12*y))), 3),
    area.R.70 <- round(with(clines.R[[7]], areapl(cbind(12*x,12*y))), 3),
    area.R.80 <- round(with(clines.R[[8]], areapl(cbind(12*x,12*y))), 3),
    area.R.90 <- round(with(clines.R[[9]], areapl(cbind(12*x,12*y))), 3))

###LHB Zone Area
clines.L <- contourLines(x=seq(from=-1.5, to=1.5, length=100), y=seq(from=1, to=4, length=100), mypredict.L)
areas.L <- c(area.L.10 <- round(with(clines.L[[1]], areapl(cbind(12*x,12*y))), 3),
    area.L.20 <- round(with(clines.L[[2]], areapl(cbind(12*x,12*y))), 3),
    area.L.30 <- round(with(clines.L[[3]], areapl(cbind(12*x,12*y))), 3),
    area.L.40 <- round(with(clines.L[[4]], areapl(cbind(12*x,12*y))), 3),
    area.L.50 <- round(with(clines.L[[5]], areapl(cbind(12*x,12*y))), 3),
    area.L.60 <- round(with(clines.L[[6]], areapl(cbind(12*x,12*y))), 3),
    area.L.70 <- round(with(clines.L[[7]], areapl(cbind(12*x,12*y))), 3),
    area.L.80 <- round(with(clines.L[[8]], areapl(cbind(12*x,12*y))), 3),
    area.L.90 <- round(with(clines.L[[9]], areapl(cbind(12*x,12*y))), 3))

###combine together for analysis
zoneAreas <- data.frame(cbind(areas.R, areas.L))
zoneAreas$prob <- c(0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9)
zoneAreas

  areas.R areas.L prob
1 645.723 623.354  0.1
2 570.560 546.404  0.2
3 521.785 497.718  0.3
4 482.284 459.095  0.4
5 446.378 424.566  0.5
6 410.759 390.815  0.6
7 372.335 354.908  0.7
8 325.962 312.230  0.8
9 258.251 250.408  0.9

As you can see, the area of the strike zone contours decreases as the probability of a given pitch being called a strike increases. This makes sense, as we would expect a much smaller area where strikes are called at least 90% of the time (right down the middle) relative to when strikes are called at least 10% of the time (often well outside the rulebook strike zone). In other words, our 90% contour is nested inside the 10% contour.

Next, we’ll use the joinPolys function from PBSmapping to identify the overlap between our empirically estimated strike zone and the MLB rulebook strike zone. I define the rulebook strike zone using anthropometric data and average batter height, and the diameter of the ball on either side of the edge of the plate (see my paper linked earlier for a more specific description). This rulebook zone is approximately 454 square inches.

###Exhibit zone overlap
poly.R <- data.frame(clines.R[[5]])
poly.R <- poly.R[,2:3]
poly.R$PID <- 1
poly.R$POS <- seq(from=1, to=nrow(poly.R), by=1)
colnames(poly.R) <- c("X", "Y", "PID", "POS")

poly.L <- data.frame(clines.L[[5]])
poly.L <- poly.L[,2:3]
poly.L$PID <- 1
poly.L$POS <- seq(from=1, to=nrow(poly.L), by=1)
colnames(poly.L) <- c("X", "Y", "PID", "POS")

###define rulebook average strike zone
zBot <- 1.52
zTop <- 3.42
zW <- 0.83

ruleZone <- data.frame(PID=rep(1, 4), POS=1:4, X=c(-zW,-zW,zW,zW), Y=c(zBot,zTop,zTop,zBot))
area.Zone <- ((12*.83)-(12*(-.83)))*((3.42*12)-(1.52*12))
area.Zone
[1] 454.176

###Join Our Polygons for the contour and the rule zone (from PBSmapping)
overlap.R <- joinPolys(poly.R, ruleZone)
overlap.L <- joinPolys(poly.L, ruleZone)

Now we can plot the overlap using the base graphics in R and the polygon function with our overlap calculations from above. I fill in only the area that overlaps between the empirically derived zone and the rulebook strike zone as we define it here. As you can see in the plot below, R does a nice job of crisply filling in the overlapping area of the two polygons we created.

###Contour Overlap Plot
png(file="JoeWestOverlapPlot.png", height=450, width=828)
par(mfrow=c(1,2))
plot(1,1,ylim=c(1,4),xlim=c(-1.5,1.5), t="n", xlab="Horizontal Location (Umpire's View)", ylab="Vertical Location", main="Surface Area Within Rulebook Zone (RHB)")
polygon(poly.R$X, poly.R$Y, border=2, lwd=2)
polygon(ruleZone$X, ruleZone$Y, lty="dashed")
polygon(overlap.R$X, overlap.R$Y, col=rgb(0,0,1,0.2), border=F)
plot(1,1,ylim=c(1,4),xlim=c(-1.5,1.5), t="n", xlab="Horizontal Location (Umpire's View)", ylab="Vertical Location", main="Surface Area Within Rulebook Zone (LHB)")
polygon(poly.L$X, poly.L$Y, border=2, lwd=2)
polygon(ruleZone$X, ruleZone$Y, lty="dashed")
polygon(overlap.L$X, overlap.L$Y, col=rgb(0,0,1,0.2), border=F)
dev.off()

JoeWestOverlapPlot

Finally, let’s identify the specific area of our empirically derived strike zone that lies inside the rulebook zone, the area of the empirically derived strike zone that lies outside of the rulebook strike zone, and the area inside the rulebook strike zone that is more likely to be called a ball than a strike.

###Calculate the overlap between the different zone boundaries and the rulebook zone for RHB & LHB
#Right Handed Batters
poly.R.50 <- data.frame(clines.R[[5]])
poly.R.50 <- poly.R.50[,2:3]
poly.R.50$PID <- 1
poly.R.50$POS <- seq(from=1, to=nrow(poly.R.50), by=1)
colnames(poly.R.50) <- c("X", "Y", "PID", "POS")
overlap.R.50 <- joinPolys(poly.R.50, ruleZone)
strikeAreaInside.R.50 <- round(with(overlap.R.50, areapl(cbind(12*X,12*Y))), 2)
strikeAreaOutside.R.50 <- area.R.50 - strikeAreaInside.R.50
ballAreaInside.R.50 <- area.Zone - strikeAreaInside.R.50
ZoneOverlap <- data.frame("R", "50%", strikeAreaInside.R.50, strikeAreaOutside.R.50, ballAreaInside.R.50)
colnames(ZoneOverlap) <- c("Handedness", "Boundary", "Strike_Area_Inside_Zone", "Strike_Area_Outside_Zone", "Ball_Area_Inside_Zone")

#Left Handed Batters
poly.L.50 <- data.frame(clines.L[[5]])
poly.L.50 <- poly.L.50[,2:3]
poly.L.50$PID <- 1
poly.L.50$POS <- seq(from=1, to=nrow(poly.L.50), by=1)
colnames(poly.L.50) <- c("X", "Y", "PID", "POS")
overlap.L.50 <- joinPolys(poly.L.50, ruleZone)
strikeAreaInside.L.50 <- round(with(overlap.L.50, areapl(cbind(12*X,12*Y))), 2)
strikeAreaOutside.L.50 <- area.L.50 - strikeAreaInside.L.50
ballAreaInside.L.50 <- area.Zone - strikeAreaInside.L.50
lefties <- data.frame("L", "50%", strikeAreaInside.L.50, strikeAreaOutside.L.50, ballAreaInside.L.50)
colnames(lefties) <- c("Handedness", "Boundary", "Strike_Area_Inside_Zone", "Strike_Area_Outside_Zone", "Ball_Area_Inside_Zone")
ZoneOverlap <- data.frame(rbind(ZoneOverlap, lefties))
ZoneOverlap
  Handedness Boundary Strike_Area_Inside_Zone Strike_Area_Outside_Zone Ball_Area_Inside_Zone
1          R      50%                  381.33                   65.048                72.846
2          L      50%                  357.42                   67.146                96.756

And there you have it. We can see that our 50% boundary is mostly within the rulebook strike zone. There is about 65-67 square inches of the boundary area outside the rulebook zone. And a somewhat significant amount of area within the rulebook zone that is more likely to be called a ball than a strike. One of the interesting things to do next is look at the distance between boundaries and whether they’re getting closer over time. In the paper I link above, I do this and we see boundaries getting closer and closer together, meaning the strike zone boundary for umpires is becoming more crisp than it used to be (less fuzzy ball-strike area at the edges). There is plenty of room for improvement, but maybe in my next post we can look at areas between each of the contour boundaries. We can also toy a bit with putting some confidence intervals on our contours, which I’ve ignored here.

Follow

Get every new post delivered to your Inbox.

Join 52 other followers