Exploring Hit by Pitch Rates

As a Phillies fan, it has been a tough season, but I enjoyed watching Chase Utley in the 2014 All-Star Game. Ultey had two notable plate appearances – he hit a towering double in his first PA and was hit by a pitch in his second PA. If you know anything about Utley, he know that he tends to get hit by a pitch, so this event during the All-Star Game seemed in character.

Anyway, this observation motivated a few questions. First, how has the tendency to be hit by a pitch changed over seasons? Second, who are the HBP stars, that is, players that tend to have high HBP rates?

We use the Lahman season data. Below the variable “location” is the name of the folder where I have downloaded the Lahman text files. I read in the Lahman batting data and stored it in the data frame Batting.

location <- "/Users/albert/Desktop/lahman-csv_2013-12-10/"
Batting <- read.csv(paste(location, "Batting.csv", sep=""))
Master <- read.csv(paste(location, "Master.csv", sep=""))

Using the merge function, I add the players’ first and last names from the Master file to the Batting data frame.

Batting <- merge(Batting, 
                 Master[, c("playerID", "nameFirst", "nameLast")],

The function hbp.work will summarize the HBP data for all hitters in a particular season. Here are the things that the function will do. (I illustrate functions using the dplyr package.)

  1. Use the filter function to focus on the batting data for a particular season.
  2. Use the mutate function to create a plate appearance variable.
  3. Use the filter function again to limit work to players with at least one plate appearance.
  4. Use a function from the LearnBayes package to smooth the HBP rates. The mutate function is used to define the HBP (smoothed) estimates. (I use this same method to smooth batting averages in a previous post.)
  5. These HBP estimates are transformed to logits by the mutate function.
  6. Using the standard EDA rule for flagging outliers, identifies players (among those with at least 300 PA) who have unusually high HBP rates
hbp.work <- function(season){
  fit.model <- function(d){
     fit <- laplace(betabinexch, c(1, 1), d)
     eta <- with(fit, exp(mode[1]) / (1 + exp(mode[1])))
     K <- exp(fit$mode[2])
     return(list(eta=eta, K=K))
  Batting.season <- filter(Batting, yearID == season)
  Batting.season <- mutate(Batting.season,
                       PA = AB + BB + HBP + SH + SF)
  Batting.season <- filter(Batting.season, PA >= 1)
  F <- fit.model(with(Batting.season, cbind(HBP, PA))) 
  Batting.season <- mutate(Batting.season,
                       HBP.rate = HBP / PA,
              HBP.estimate = (HBP + F$K * F$eta) / (PA + F$K))  
  Batting.season <- mutate(Batting.season,
            Logit.estimate = log(HBP.estimate / (1 - HBP.estimate)))
  HBP.season <- dplyr::select(Batting.season, nameFirst, nameLast, 
                   PA, HBP, HBP.rate, 
  HBP.season <- arrange(HBP.season, desc(HBP.estimate))
  S <- fivenum(filter(HBP.season, PA >= 300)$Logit.estimate)
  fence <- S[4] + 1.5 * diff(S[c(2, 4)])
  HBP.season <- mutate(HBP.season,
            Outlier=ifelse(Logit.estimate > fence, "yes", "no"))
  data.frame(Season=season, HBP.season)

Illustrate this function for the 2008 season.

HBP.2008 <- hbp.work(2008)
##   Season nameFirst   nameLast  PA HBP HBP.rate HBP.estimate Logit.estimate
## 1   2008     Chase      Utley 707  27  0.03819      0.03010         -3.473
## 2   2008     Jason     Giambi 565  22  0.03894      0.02923         -3.503
## 3   2008    Carlos    Quentin 569  20  0.03515      0.02670         -3.596
## 4   2008      Alex       Cora 179   9  0.05028      0.02539         -3.648
## 5   2008     Chris   Iannetta 407  14  0.03440      0.02422         -3.696
## 6   2008      Josh Willingham 416  14  0.03365      0.02390         -3.710
##   Outlier
## 1      no
## 2      no
## 3      no
## 4      no
## 5      no
## 6      no

In the 2008 season, Chase had the highest HBP rate of 27 / 707 = 0.03819, but a better estimate at his HBP ability is the estimate 0.03010. For the purpose of looking at a group of rates, it helps to reexpress to a logit scale, where a logit = log (rate / (1 – rate)). In this season, note that Alex Cora had a higher HBP rate than Utley, but we believe Utley has a higher HBP ability.

I collect HBP estimates for all players for the seasons 1960 through 2013.

for (season in 1960:2013)
  D <- rbind(D, hbp.work(season))

Graph the logits of the HBP estimates for all players with at least 300 PA using the ggplot2 package.

ggplot(filter(D, PA >= 300), 
     aes(factor(Season), Logit.estimate)) + 
  geom_boxplot() +


There are some interesting things that we see from the graph.

  1. Generally, HBP rates have gone through significant changes in the seasons 1960 through 2013. They hit a low in the mid 1980’s, went through a steady increase and hit a peak about 2003.

  2. There are a significant number of HBP stars, people who had unusually high HBP estimates in this period.

The Outlier variable in the data frame indicates all of the HBP outliers — by tabulating this variable, we can learn who were the HBP stars during the 1960 – 2013 period. It turns that there were only three players who were outliers for 5 or more seasons: Don Baylor (11 seasons), Ron Hunt (7 seasons), and Chet Lemon (5 seasons). By the way, it interesting that Craig Biggio has more career HBP than Baylor, but Biggio only appeared as a HBP outlier for three seasons.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: