Monthly Archives: April, 2017

Mookie Betts Streak — Working with Streak Data

Mookie’s Streak

There was a noteworthy recent baseball streak event — Mookie Betts went 129 plate appearances without striking out. This motivated me to add some new functionality to my BayesTestStreak package. I’ll illustrate some of the functions here, focusing on no-strikeout streaks for all regular players in the 2016 season.

Installing the Package

You can install the current version of BayesTestStreak from Github:

library(devtools)
install_github("bayesball/BayesTestStreak")

Obtaining Some Streak Data

Some variables from the 2016 Retrosheet play-by-play dataset are included in the package as the data frame pbp2016 . You can use the streak_data function to get the streak data (vector of successes and failures) for a specific player. You specify how to define a success (either “H”, “HR”, “OB”, or “SO”) and whether you want to have all plate appearances or just official at-bats. Here I first use the find_id function to find the Retrosheet player code for Mookie.

library(BayesTestStreak)
find_id("Mookie Betts")
[1] "bettm001"
y <- streak_data("bettm001", pbp2016, "SO", AB=FALSE)

The variable y is just a vector of 0’s and 1’s corresponding to no-strikeouts and strikeouts. To check that I have the right data, I confirm that Mookie had 80 SO (and 650 non-strikeouts) for the 2016 season.

table(y)
y
  0   1 
650  80 

Some Graphs of This Data

This package provides several ways of graphing this data. The plot_streak_data function provides a simple line graph where the PA locations of strikeouts are indicated by vertical lines. We see a long white space at the end of the 2016 season corresponding to a run of non-strikeouts.

plot_streak_data(y)

To see short-term patterns of strikeout rates, one can use the mavg_plot that provides a moving average plot. The inputs are the streak data (vector of 0’s and 1’s) and the window length (here I use 50 PA). The areas represent the streaky patterns of hitting away from Mookie’s overall 2016 strikeout rate. Here we see that Mookie actually had a rash of strikeouts early in the season, but had a strong non-strikeout streak at the end of the 2016 season.

mavg_plot(y, 50)

Looking at Streaks

The media is interested in the lengths of the runs of non-strikeouts — I call these spacings. One can compute all of the spacings by use of the find.spacings function. We see that the gaps between consecutive strikeouts is 0, 4, 3, 0, 5, etc. We see the long gap of 78 at the conclusion of the 2016 season. The I variable indicates with a 0 that the last spacings value of 78 did not end — in fact we know Mookie continued with 129 – 78 = 51 non-strikeouts at the beginning of the 2017 season.

  
find.spacings(y)
$y
 [1]  0  4  3  0  5  0  2  0  5  1  0 12  3  2  4  1  7  2 21  9
[21]  0 11 10  3  8 16  1 14 13  2  1  3 12 18  2 22  5 10  8  0
[41] 10 15  0  2  5  2  6 14  1  3  2 33  7  7  1 14  4  2 27 16
[61]  1  1  6  2  5 16 13 29  5  9  4  1 14 18  3  0  3  1 27  3
[81] 78

$I
 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[31] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[61] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0

Looking at Longest Streak of Non-Strikeouts for All Players

Suppose I’m interested in looking at the longest run of non-strikeouts for all players in the 2016 season with at least 300 PA’s. Here is what I do in the R script below.

  1. I find the player codes for the players in the 2016 season with at least 300 PAs.
  2. Using the map function from the purrr package, I find the streak data for each player — this list of vectors is stored in the variable out .
  3. I write a function longest_ofer that computes the longest streak for a given player. Then using the map_dbl function, I apply this function to all the vectors in out .
  4. Similarly, by a second application of map_dbl I find the strikeout rate for all players.
summarize(group_by(pbp2016, BAT_ID), 
          N = sum(BAT_EVENT_FL)) %>%
  filter(N >= 300) %>%
  select(BAT_ID) -> S300
out <- map(S300$BAT_ID,
           streak_data, pbp2016, "SO", AB=FALSE)
longest_ofer <- function(y){
  max(find.spacings(y)$y)
}
L_ofer <- map_dbl(out, longest_ofer)
Rate <- map_dbl(out, mean)

Last I construct a scatterplot of the strikeout rate and longest non-strikeout streak for all players. (I am not showing the code here.) I label the players with longest streaks exceeding 50. We see a strong association between a player’s strikeout rate and his longest “ofer” in his SO/not-SO sequence.

What Have We Learned?

So we see there were six players with non-strikeout streaks exceeding 50 for the 2016 season. What does it mean to have a long non-streakout streak? It can mean several things. Obviously, all of these players are tough to strike out and they have a talent to make contact with the ball. But are these players particularly streaky in their pattern of strikeouts? That is, are their patterns of 0’s and 1’s unusual given their general strikeout ability and number of PA’s? We know Betts had the longest non-strikeout streak, but we don’t know if he was the most streaky among the 2016 players with respect to their strikeout hitting.

Actually there is statistical evidence to suggest that the most remarkable streak among these six players was Adam Eaton, not Mookie Betts. Betts had the longest non-strikeout streak, but Eaton’s pattern of strikeouts was most unusual given his overall strikeout rate and number of PA’s. The statistical evidence is based on a permutation test that can be implemented using the permutation.test in the BayesTestStreak package. I applied a permutation test in an earlier post of assessing situational hitting.