### Miller and Sanjurjo’s Paper

I happened to look at a recent ESPN article that gives a layman’s description of Miller and Sanjurjo’s recent paper discovering a bias in looking at hot-hand data. This paper has got a lot of attention for obvious reasons — it really impacts many of the hot-hand studies in sports. Here I will describe this bias in a simple setting, show how one can use a R simulation to compute the bias in more complicated settings, and then explain the implication of this bias in looking for streaky patterns in strikeout data.

### A Simple Setting

Suppose you flip a fair coin four times. You observe two heads in a row — what is the chance that the next flip is a head? One might be quick to say “1/2” — but this is incorrect. The fact that you have observed two heads in a row has changed the possible outcomes. If we have four flips then there are 16 possible outcomes shown below:

HHHH, HTHH, THHH, TTHH

HHHT, HTHT, THHT, TTHT

HHTH, HTTH, THTH, TTTH

HHTT, HTTTT, THTT, TTTT

But saying that we have observed two H’s in a row and have the opportunity to observe the next flip indicates that only six outcomes are possible.

HHHH, THHH

HHHT, THHT

HHTH

HHTT

For each of these six outcomes, we find the average number of heads in the next flip. Then since each outcome has probability 1/6, then we find that the probability of H on the next flip is really

P(H on the next flip) = (1/6) (1 + 1 + 1/2 + 0 + 0 + 0) = 2.5 / 6 = 0.417.

which is smaller than the naive answer of 0.5.

### Simulating this Process in R

Here is a simple R function that simulates one replication of this process. One flips `n`

coins where the probability of a success is `p`

. One looks for a streak of H’s of length `slength`

and the function outputs the mean proportion of H in the following flip.

one_sim_new <- function(n=20, p=0.5, slength=3){ flips <- rbinom(n, size=1, prob=p) detect <- function(k){ sum(flips[k:(k + slength - 1)]) == slength } locations <- which(sapply(1:(n - slength), detect)) + (slength - 1) y <- c(flips[locations + 1]) mean(y, na.rm=TRUE) }

Here I repeat this simulation 10,000 times for our example where `n`

= 2, `p`

= 0.5, and `slength`

= 2 and I obtain an answer by simulation which agrees closely with the exact calculation of 0.417.

mean(replicate(10000, one_sim_new(n=4, p=0.5, slength=2)), na.rm=TRUE) [1] 0.4157109

### Size of the Bias

Assuming that the probability of success `p`

is constant, the size of this bias will depend on the length of the sequence `n`

and how many H’s in a row that we observe `slength`

. Below I illustrate the results of one simulation study — we fix `p`

= 0.3, and see how the bias (deviation from the probability of success 0.3) varies as the length of the sequence `n`

goes from 50 and 500, and the number of consecutive heads `slength`

is 1, 2, and 5. We see from the figure that

- the bias is greatest after you look after 5 consecutive heads
- for a fixed value of
`slength`

, the bias decreases as`n`

increases

### Streakiness in Strikeouts

Here is an illustration how this bias impacts streakiness studies in baseball. Let’s focus on streakiness in striking out. I collected all of the players with at least 500 PA in the 2016 season. I looked at all of the players who had at least 10 streaks of 3 consecutive strikeouts during the season, and for each player I computed the fraction of strikeouts on the PA following the 3 consecutive K’s. Here is a scatterplot of the player’s overall K rate against the fraction of K’s on the next PA following a streak of three K’s. I overlay the line y = x — of the 27 players, 20 (74%) had a larger K rate on the following PA than the overall K rate.

But there is a bias in this comparison. I was contrasting the fraction of the K’s with the overall K rate, but that is the wrong comparison. If the K’s are generated randomly, then the probability of a K following a streak of three K’s will be smaller than the player’s K rate due to this bias effect.

So I added the wrong line to the graph. Below I add a new red line that finds the probability of a K on the next PA (by simulation) for random data using the overall K rate as the probability of success.

So actually there is **more evidence** that the player’s K rate is “large” following a streak of three K’s. It should be clear from this example how this bias for random data will affect one’s inference about streakiness.

### General Comments

It would be interesting to revisit some of the earlier streakiness studies in baseball and see how this bias impacts the conclusions of those studies. Generally, this is another illustration how our beliefs about true randomness are off, and that makes it harder for one to make sense of observed streakiness that we see in sports data. It would be interesting to expand the above example over many seasons using different definitions of success such as hit, on-base, home run, etc.