# Previously in Exploring Baseball Data with R

In my last post we examined differences in the width of the MLB strike zone as measured by the PITCHf/x data, compared to the rule book definition of the strike zone width. In this post, we’ll continue examining methods for properly parameterizing the MLB strike zone, this time, by addressing the upper and lower boundaries. I should mention that, in order to set the stage for this post, I calculated umpire accuracy rates by aggregating the mean for each batter’s upper and lower strike zone boundaries. In this post, we’ll look at a more accurate method for accounting for these boundaries.

Again, we’re primarily interested in understanding umpires’ patterns of decision making. When measuring anything, we must ensure all parameters are properly operationalized. We’ve already seen that when we account for the radius of a regulation sized baseball by adding 1.57 inches to each side of the plate, umpires’ accuracy rates improve from approximately 86.41% to about 92.42%. This difference of 6% matters–especially if you’re at all tired of hearing people complain about how “bad” umpires are.

Unlike the width, the upper and lower boundaries of the strike zone are dynamic. Let’s refer back to Rule 2.00 of the Official Baseball Rules (MLB, 2014):

The STRIKE ZONE is that area over home plate the upper limit of which is a horizontal
line at the midpoint between the top of the shoulders and the top of the uniform
pants, and the lower level is a line at the hollow beneath the kneecap. The Strike Zone shall be determined from the batterâs stance as the batter is prepared to swing at a pitched ball (p. 21).

# Data and R Library Setup

For this post, I’ll load the following R packages as well as the PITCHf/x data from 2014. I’ve already scraped and cleaned the 2014 PITCHf/x data using Carson’s `pitchRx` package. If you want to get up and running with your own PITCHf/x database, check out Carson’s earlier posts.

```# Load package libraries
library(ggplot2)
library(dplyr)

# Read in 2014 PITCHf/x data
```

# Accounting for the Upper and Lower Boundaries

PITCHf/x data include variables for the upper (`sz_top`) and lower boundaries (`sz_bot`) of player’s strike zones. Several issues are known about how these parameters are determined. Chief among them, there appears to be a great deal of variability in player’s upper and lower boundaries over the course of the season. It’s hard to imagine players modify their batting stance over the course of the season to account for this variability. Instead, this is likely due to the fact that each stadium’s PITCHf/x operator, to the best of my knowledge, sets these parameters manually. Brian has also addressed this issue before. To illustrate, let’s take a look at a density plot for Matt Carpenter’s upper and lower boundaries from 2014. To do this, we can use `dplyr` to extract all pitches Carpenter saw in 2014.

```# Filter all Matt Carpenter 2014 at bats
carp <- pfx_14 %>%
#select(sz_top, sz_bot, player_sz_top, player_sz_bot) %>%
filter(batter_name == "Matt Carpenter")
```

Now that we have a our new `carp` data frame, let’s perform a simple Shapiro-Wilk test of normality for Carpenter’s `sz_top` and `sz_bot` values. If the distribution is normal, we should see \$p\$ values >.05.

```# Shapiro-Wilk test of normality for
# Matt Carpenter's 2014 sz_top/sz_bot
shapiro.test(carp\$sz_top)
```
```##
## Shapiro-Wilk normality test
##
## data: carp\$sz_top
## W = 0.9414, p-value < 2.2e-16
```
```shapiro.test(carp\$sz_bot)
```
```##
## Shapiro-Wilk normality test
##
## data: carp\$sz_bot
## W = 0.9221, p-value < 2.2e-16
```

We can see that neither of Matt Carpenter’s upper and lower boundary distributions are normal. This is problematic when aggregating data over the course of multiple seasons. It may also help to visualize these distributions. Let’s look at Carpenter’s upper

```# Density plot of Matt Carpenter's 2014 sz_top
ggplot(data = carp, aes(x = sz_top)) +
geom_density(fill = "gray35") +
scale_x_continuous(breaks = seq(0, 6, 0.10),
name = "Top of Strike Zone (sz_top)") +
scale_y_continuous(limits = c(0, 8),
breaks = seq(0, 8, 1), name = "Density") +
theme_bw()
```

```# Density plot of Matt Carpenter's 2014 sz_bot
ggplot(data = carp, aes(x = sz_bot)) +
geom_density(fill = "gray35") +
scale_x_continuous(breaks = seq(0, 2, 0.10),
name = "Bottom of Strike Zone (sz_bot)") +
scale_y_continuous(limits = c(0, 16),
breaks = seq(0, 16, 2), name = "Density") +
theme_bw()
```

# What About Batter’s Height?

Perhaps a better way to account for the upper and lower boundaries is to consider batters’ heights. Although still imperfect, I think this may help provide some consistency when properly parameterizing the strike zone. Moreover, accounting for batters’ heights may also be more in-line with Rule 2.00. The PITCHf/x data conveniently includes a variable for the height of all batters (see `b_height`). One minor issue is that batters’ heights are stored as character vectors. To correct for this, I’ve converted the PITCHf/x label to inches. From here, we can employ an anthropometric strategy. Briefly, anthropometry refers to the comprehensive measurement of human individuals.

Turns out the U.S. Army has been collecting a wealth of anthropometric data on all enlistees since 1989 in a study called the U.S. Army Anthropometry Survey (ANSUR). In the most recent data available from this study there were 166 different measurements taken from 1,774 men. Here’s the final report. A new wave of data were collected in 2010, but these data are not currently available to the public. Based on a conversation I had recently with Matt Reed, head of the Biosciences Group of the University of Michigan Transportation Research Institute, the 1989 data do not appear to differ all that greatly from the 1989 wave.

Several measurements from the ANSUR data are of interest here: 1. STATURE: Individual’s height (all ANSUR data are in millimeters) 2. WAIST_HT_NATURAL: length from ground to midpoint between shoulders and waist, and 3. PATELLA.MID_HT: length from ground to knee. The latter two are essentially the proper upper and lower boundaries, according to Rule 2.00, of the strike zone. See figure below.

Using `dplyr`, I extracted the STATURE, WAIST_HT_NATURAL, and PATELLA.MID_HT variables from the ANSUR data and converted the values from millimeters to inches. From there I joined the `ansur_hts` data frame below to my `pfx_14` data frame. This is all pre-loaded in the `pfx_14` data frame we read in above. In case you’re interested, here’s how I did it.

```# Read in raw ANSUR data

# Select a few variables from ANSUR data
ansur <- ansur %>%
select(STATURE, WAIST_HT_NATURAL, PATELLA.MID_HT)

# Convert ANSUR dimensions from millimeters to inches
ansur <- compute(ansur*0.0393701)

# Round values
ansur\$STATURE <- round(ansur\$STATURE)

# Convert ANSUR data from inches to feet
ansur_hts <- ansur %>%
select(b_height = STATURE, WAIST_HT_NATURAL, PATELLA.MID_HT) %>%
group_by(b_height) %>%
summarize(ht_top = mean(WAIST_HT_NATURAL)/12, ht_bott = mean(PATELLA.MID_HT)/12)
```

# Updated Umpire Accuracy

By comparing the umpire’s original decision (`call`) to the pitch’s location (`zone`), I added a dichotomous variable, `u_test_ht`, to my `pfx_14` data frame, which represents the accuracy of the umpire’s decision (i.e., 1 = correct, 0 = incorrect), accounting for both the radius of the baseball and our new `ht_top` and `ht_bott` variables from above.

```# Create *u_test_ht* variable for umpire's decision [1 = correct]
# given ANSUR data/batter height transformation
# Includes ball radius = ((1.57*2 + 17) / 12) / 2 = 0.8391667
pfx_14\$u_test_ht <- with(pfx_14,
ifelse(call == "Ball" & px < -0.8391667 | px > 0.8391667 |
pz < ht_bott | pz > ht_top, 1,
ifelse(call == "Called Strike" & pz >= ht_bott & pz <= ht_top &
px >= -0.8391667 & px <= 0.8391667, 1,
ifelse(call == "Ball" & pz >= ht_bott & pz <= ht_top &
px > -0.8391667 & px < 0.8391667, 0,
ifelse(call == "Called Strike" & px < -0.8391667 | px > 0.8391667 |
pz < ht_bott | pz > ht_top, 0, 99)))))
```

Now, let’s examine umpire accuracy over the course of the entire 2014 season while accounting for the ANSUR dimensions.

```# Calculate mean and se umpire accuracy rates
(pfx_accuracy <- pfx_14 %>%
group_by(umpire) %>%
summarize(accuracy = mean(u_test_ht),
se = sd(u_test_ht) / sqrt(length(u_test_ht)))
)
```

The above output includes each individual umpire’s mean accuracy for all of the ball/strike decisions he made in 2014. Over the course of the season, umpires were approximately 90.46% (SD = 0.12). Interesting! Before, when we used each batter’s mean upper and lower boundaries, the umpire accuracy rate was 92.42% (SD = 0.013). Although this may not see like much of a difference, I think these values are perhaps more reliable.

```# Calculate mean *accuracy*
with(pfx_accuracy, mean(accuracy))

# Calculate sd *accuracy*
with(pfx_accuracy, sd(accuracy))
```

It might help to visualize these accuracy rates in a dotplot. I’ve set one up below and added the SE to each side of the accuracy points. This plot uses our `pfx_accuracy` data frame from above.

```# Sort *pfx_accuracy* in descending order
sort_pfx_accuracy <- pfx_accuracy[order(-pfx_accuracy\$accuracy), ]

### --- Build dotplot --- ###
ggplot(data = sort_pfx_accuracy,
aes(x = accuracy, y = sort(umpire, decreasing = TRUE))) +
geom_vline(aes(xintercept = mean(accuracy)),
color = "red", linetype = 2, size = 0.35) +
geom_segment(aes(x = accuracy - se, xend = accuracy + se,
y = sort(umpire, decreasing = TRUE),
yend = sort(umpire, decreasing = TRUE)),
color = "gray30", size = 0.25) +
geom_line(aes(group = 1), color = "gray30") +
geom_point(color = "gray10") +
scale_x_continuous(limits = c(0.78, 0.95), breaks = seq(0.78, 0.95, 0.010),
name = "nUmpire Decision AccuracynWidth of Strike Zone = 20.14 inches") +
scale_y_discrete(name = "", labels = rev(sort_pfx_accuracy\$umpire)) +
theme_bw() +
theme(axis.title.x = element_text(size = 10),
axis.text.x = element_text(size = 8)) +
theme(axis.title.y = element_text(size = 10),
axis.text.y = element_text(size = 6))
```

Using our new upper and lower boundary parameters, umpire accuracy appears to vary from approximately 82.17% to about 93.08%. Again, the data include both full-time and AAA call-up umpires who may have either worked spring training or various regular season games. I realize this may be somewhat problematic, but since this series of posts are somewhat exploratory in nature, I’ll keep the MiLB guys in. In my next post, we’ll use some multilevel modeling strategies to try and account for any umpire-level effects.

All data files and R code used are available in a GitHub Repository. As always, feedback is welcome!