Last night, I watched the Kansas City Royals finish their sweep of the Los Angeles Angels. One of the interesting aspects of the Royals was their propensity for stealing bases. This inspired me to explore stolen bases, or more accurately stolen base attempts of 2nd base.
We’ll use Retrosheet play-by-play from the 2013 season to get answers to the following questions:
- How did teams differ in SB attempts and their success rates?
- When do teams attempt stolen bases? What innings and how many outs?
- We know pitchers differ in their tendency to allow stolen bases? In the 2013, which pitchers led in SB attempts, and what were the SB success rates for these pitchers?
I’ve described the process of downloading Retrosheet play-by-play data here. In the below code, I have the Retrosheet data in the file “all2013.csv”, a file containing the headers in “fields.csv” and a rosters file “roster2013.csv”.
I read these files for the 2013 season in R.
all2013 <- read.csv("~/Desktop/PGP Folder/download.folder/unzipped/all2013.csv", header=FALSE) fields <- read.csv("~/Desktop/PGP Folder/download.folder/unzipped/fields.csv") names(all2013) <- fields$Header roster2013 <- read.csv("~/Desktop/PGP Folder/download.folder/unzipped/roster2013.csv")
We will focus of steals of second base. The relevant variables are
RUNS1_CS.FL . We create a new data frame
stealing.first that only considers these events.
library(dplyr) stealing.first <- filter(all2013, RUN1_SB_FL == TRUE | RUN1_CS_FL == TRUE )
First, we are interested in seeing how the number of stolen base attempts and success rates vary by team. We create a new variable
BAT_TEAM_ID that is the identity of the team who is batting and trying to steal the base.
stealing.first <- mutate(stealing.first, HOME_TEAM_ID=substr(GAME_ID, 1, 3)) stealing.first <- mutate(stealing.first, BAT_TEAM_ID=ifelse(BAT_HOME_ID==1, as.character(HOME_TEAM_ID), as.character(AWAY_TEAM_ID)))
We compute the number of stolen bases and caught stealing for all 30 teams and place these in the data frame
From this data frame, we compute the number of attempts and the success rate.
team.stealing <- summarize(group_by(stealing.first, BAT_TEAM_ID), SB=sum(RUN1_SB_FL==TRUE), CS=sum(RUN1_CS_FL==TRUE)) team.stealing <- mutate(team.stealing, Success.Rate = SB / (SB + CS), Attempts = SB + CS)
We construct a scatterplot of the attempts and success rates for all teams. Note that teams that tend to steal more bases also tend to be more successful. The variability both in SB attempts and success rates is remarkable. Clearly, teams place different values on stolen bases, and I suppose that teams have different “speed” players and coaching expertise in how to steal bases.
library(ggplot2) ggplot(team.stealing, aes(Attempts, Success.Rate, label=BAT_TEAM_ID)) + geom_point() + geom_smooth(method="lm")+ geom_text() + labs(title="1B Steal Attempts and Success Rates for All 2013 Teams")
Net we look at stealing of second base during different out situations. I use the
mutate functions to break down attempts and success rate by the number of outs. Note that it is most likely to steal 2nd base on two outs. Also, runners are most successful when there are two outs.
outs.stealing <- summarize(group_by(stealing.first, OUTS_CT), SB=sum(RUN1_SB_FL==TRUE), CS=sum(RUN1_CS_FL==TRUE)) outs.stealing <- mutate(outs.stealing, Success.Rate = SB / (SB + CS), Attempts = SB + CS) outs.stealing
## Source: local data frame [3 x 5] ## ## OUTS_CT SB CS Success.Rate Attempts ## 1 0 559 244 0.6961 803 ## 2 1 785 359 0.6862 1144 ## 3 2 976 291 0.7703 1267
To explore the patterns by inning, we do a similar breakdown by the
INN_CT variable. Runners tend not to be successful in stealing 2nd base in the 2nd and 4th innings,
and they are more successful in late innings.
inning.stealing <- summarize(group_by(stealing.first, INN_CT), SB=sum(RUN1_SB_FL==TRUE), CS=sum(RUN1_CS_FL==TRUE)) inning.stealing <- mutate(inning.stealing, Inning=pmin(INN_CT, 10) ) inning.stealing <- summarize(group_by(inning.stealing, Inning), SB=sum(SB), CS=sum(CS), Success.Rate=SB / (SB + CS)) inning.stealing
## Source: local data frame [10 x 4] ## ## Inning SB CS Success.Rate ## 1 1 343 129 0.7267 ## 2 2 194 88 0.6879 ## 3 3 293 115 0.7181 ## 4 4 228 109 0.6766 ## 5 5 268 108 0.7128 ## 6 6 237 95 0.7139 ## 7 7 271 90 0.7507 ## 8 8 259 86 0.7507 ## 9 9 164 60 0.7321 ## 10 10 63 14 0.8182
ggplot(inning.stealing, aes(Inning, Success.Rate)) + geom_point(size=4)
Although we focus on the players who steal many bases, there are other players involved with SB’s, namely the pitcher and the catcher. Here we briefly look at the pitcher effect. We breakdown stealing by the pitcher id
PIT_ID . We sort the pitcher data frame by the number of attempts and display the top 10 with respect to SB attempts. We merge this data frame with the roster information so we can display first and last names.
stealing.pitcher <- summarize(group_by(stealing.first, PIT_ID), SB=sum(RUN1_SB_FL==TRUE), CS=sum(RUN1_CS_FL==TRUE), Attempts=SB + CS, Success.Rate=SB/Attempts) stealing.pitcher <- merge(stealing.pitcher, roster2013, by.x="PIT_ID", by.y="Player.ID") stealing.pitcher <- arrange(stealing.pitcher, Attempts) stealing.pitcher <- stealing.pitcher[!duplicated(stealing.pitcher$PIT_ID), ] N <- dim(stealing.pitcher) stealing.pitcher[(N - 9) : N, c("First.Name", "Last.Name", "SB", "CS", "Attempts", "Success.Rate")]
## First.Name Last.Name SB CS Attempts Success.Rate ## 591 Ervin Santana 13 8 21 0.6190 ## 592 Yu Darvish 15 7 22 0.6818 ## 593 Felix Hernandez 17 5 22 0.7727 ## 594 Tim Lincecum 21 2 23 0.9130 ## 595 Edinson Volquez 21 2 23 0.9130 ## 597 Justin Verlander 20 4 24 0.8333 ## 598 Anibal Sanchez 24 1 25 0.9600 ## 599 Cole Hamels 17 9 26 0.6538 ## 600 Scott Feldman 24 3 27 0.8889 ## 602 John Lackey 32 5 37 0.8649
John Lackey was by far the leader in SB attempts of 2nd base at 37 and runners were pretty successful with a rate of 86 percent. Scanning over the list, Anibal Sanchez was pretty poor in preventing SB (success rate of 96 percent), and Cole Hamels was pretty good in SB prevention (65 %).