Recently, the baseball world got exited about Dustin Pedroia’s streak of 11 consecutive base hits. That raises the obvious question: **Can Pedroia be characterized as a streaky hitter?** Of course, he was streaky in this particular hitting accomplishment, but unless we look at a larger collection of Pedroia hitting data, we really don’t know if he tended to be streaky during his career.

#### Obtain the Data

It was straightforward to obtain Pedroia’s sequence of hit/out data from the Retrosheet play-by-play data for the 2007 through 2015 seasons. Also, using the ` openWAR `

package, I was able to get the play-by-play data for the 2016 season — my work here includes the games through the end of August 2016.

#### What Do I Mean by “Streaky”?

All sequences of 0/1 data will exhibit patterns of streakiness. If you flip a coin long enough, you’ll see some interesting patterns of streaks of heads or tails but the coin is not inherently streaky. What I mean by “streaky” is that the pattern of streaks and slumps exceeds what you would predict if the data was just flips of a biased coin. (The fancy word for coin flipping is independent Bernoulli trials.) If the data resemble what you would anticipate if you had a Bernoulli process, then the player is not exhibiting unusual streaky hitting behavior.

#### One Method of Assessing Streakiness – A Geometric Plot

In my work, I have used two simple methods to assess streakiness. One method is based on looking at the lengths of the spacings or the gaps between consecutive hits. If the player has a large number of 0’s or unusual large spacings, then that might indicate some streakiness. Here are the spacings for Pedroia’s sequence of 0’s and 1’s for the 2016 season (through August).

[1] 4 1 5 1 1 0 4 0 3 0 4 4 2 4 4 4 0 1 2 2 1 0 2 [24] 0 2 6 1 0 0 2 0 3 8 0 0 3 6 0 0 6 4 5 2 1 2 5 [47] 4 3 2 3 1 5 0 1 2 1 3 0 1 4 2 1 2 2 1 2 0 2 0 [70] 5 0 4 3 2 1 1 0 9 1 0 3 1 13 4 1 1 0 3 0 4 4 2 [93] 3 3 3 5 1 2 3 0 0 0 2 4 4 1 5 4 10 0 0 0 0 3 1 [116] 2 2 0 2 5 1 3 4 1 5 9 0 7 0 2 1 2 4 1 1 0 0 0 [139] 0 1 1 0 3 5 4 0 1 5 2 3 0 3 0 0 0 0 0 0 0 0 0 [162] 0 1 3 0 0 1

If we are coin-flipping with a constant success probability, then the spacings have a geometric distribution. One can construct a graph to see how well the spacings match up to a geometric — if the points fall close to a line, then the player is not showing streakiness. On the other hand, if there is a systematic nonlinear pattern in the graph, then there is some streakiness going on.

I construct this geometric plot for each of the seasons from 2007 through 2016. Streakiness corresponds to red smoothing curves that are high for small and large spacings values and low for middle spacings value.

Looking at these plots, Pedroia looks a bit streaky for the seasons 2008, 2009, and 2010. For other seasons, the spacings look geometric (not unusually straky).

#### A Permutation Test

A more formal way to test for streakiness is by means of a permutation test. If a player is truly consistent, then all of the possible permutations of the sequence of 0’s and 1’s are equally likely. Suppose we measure streakiness by the sum of squares of the spacings (other measures can be used, but there is support for this measure). The method randomly mixes up the 0’s and 1’s, computes the value of the streaky measure, and repeats this process 1000 times. We compute a p-value, the probability the simulated measure is at least as large the observed value of the measure (based on our data). If this p-value is small, this indicates some evidence for streakiness.

I perform this streaky test for each of the 10 seasons and graph the p-values.

Note that the p-values are close to zero for seasons 2008, 2009, and 2010, confirming that Pedroia is unusually streaky in these season — that is unusual relative to the independent Bernoulli trials assumption.

#### Wrap-Up

Okay, Pedroia did get hits in 11 consecutive at-bats, but looking at the pattern of hitting for the 2016 season (through August), there is not much evidence to suggest that he was unusually streaky this season. But looking at all seasons, Pedroia’s pattern of hitting looks streaky for three early seasons in his career. I wonder if younger players tend to be streakier. When I looked at streaky patterns of Mike Schmidt home run hitting, and Schmidt tended also to be streakier in his early seasons.

#### R Work

All of the R code is given below — these will replicate this graphs produced in this post. (I use ` ggplot2 `

for the plots and ` gridExtra `

to arrange the plots in a convenient way.) The package ` TestStreak `

contains the functions ` find.spacings `

, ` geometric.plot `

, ` permutation.test `

together with the dataset ` dustin_streak `

which contains the hit/out data for Pedroia for all seasons of his career.

library(devtools) install_github("bayesball/TestStreak") library(TestStreak) plot.season <- function(season){ S <- find.spacings(subset(dustin_streak, Season==season)$Outcome) geometric.plot(S$y, paste(season, "Season")) } P <- lapply(2007:2016, plot.season) library(gridExtra) grid.arrange(P[[1]], P[[2]], P[[3]], P[[4]], P[[5]], P[[6]], P[[7]], P[[8]], P[[9]], P[[10]], ncol=3) pvalue.season <- function(season){ permutation.test(subset(dustin_streak, Season==season)$Outcome) } pv <- data.frame(Season = 2007:2016, P_Value=sapply(2007:2016, pvalue.season)) ggplot(pv, aes(Season, P_Value )) + geom_point(size=4, color="red") + ggtitle("P-Values in Permutation Test")