NOTE: This post is different from the usual posts illustrating baseball analyses using R. FiveThirtyEight just posted the article, coauthored by Rob Arthur and Greg Matthews, titled “Baseball Hot Hand is Real“. As you will see below, I was disappointed in the article and explain why. Since I know Greg Matthews, I invited him and Rob to respond to my criticism, and they did respond — this is really a bonus since it provides more details about the statistical work of this article.
Jim’s Review of the “Hot Hand is Real” Article
Although I generally have a high opinion of articles on the FiveThirtyEight site and of these particular authors, this article misses the mark. Given the common belief of hot and cold streaks among the baseball community of fans, players, and coaches, I am disappointed with flawed hot-hand statistical studies that just feed into this belief. This is a topic that I’ve worked on for a period of many years and so I’ll share some history of my work.
Here is a summary of the key points in Arthur and Matthew’s hot-hand analysis.
- They consider the velocities of the sequence of fastballs pitched by a particular starting pitcher over a season.
- They use a statistical model called a Hidden Markov Model on the time series of velocities. Basically, a Hidden Markov Model assumes that a pitcher, at each pitch, is either in a hot state (high average velocity) or a cold state (low average velocity), and one moves between states according to a Markov Chain with particular transition probabilities. (If one believes in a hot hand, then given that the pitcher is currently in a hot state, he is likely to remain in the hot state for the next pitch. Likewise, given that the pitcher is currently “cold”, then he is likely to remain “cold” for the next pitch.)
- Arthur and Matthews fit this model to a number of starting pitchers and they focus on the difference in average pitch velocities in hot and cold states. They find “significant” differences in these average pitch velocities for these pitchers, and so they conclude that pitchers really go through real hot and cold streaks.
What’s wrong with this article? I could say quite a bit, but let me focus on some major issues.
What is the Question?
The title of the article (the Hot Hand is Real) suggests that pitchers go through hot and cold periods. So I would expect that the authors would start with some general measure of a pitcher’s effectiveness and work with that measure in their hot and cold study. But no — the authors focus on the speed of their most common fastball. Huh? I understand that the pitcher’s performance is confounded by the performance of the team’s defense, but there are many summary measures of performance such as ERA, WHIP, FIP that could be used. Is the speed of the fastball associated with overall measures of performance? I understand the authors’ rationale for using fastball speed since it is a measure that is completely controlled by the pitcher. But it is a great leap to say that slow and fast fastballs are somehow connected with hot and cold summary performance of the pitcher. If the authors actually stated the question they were addressing at the beginning, there would be less interest in the study.
The Hidden Markov Model (a little history)
The authors give the impression that the Hidden Markov Model (HMM) is a new method — using the authors’ word, “it can tell if hot streaks are real … and tell whether a pitch is hot or cold at any moment. First, the HMM has been around a long time — in fact, I used it to measure streakiness of baseball hitting data in a 1993 comment (24 years ago) published in the Journal of the American Statistical Association (JASA). Christian Albright had written a JASA paper searching for streakiness in baseball hitting and I was one of the discussants of the paper. At the time, I was interested in the HMM for several reasons. First, there were new simulation-based methods (specifically Gibbs sampling) for fitting these HMMs and I thought it was a fun application of these new fitting methods. Second, I wanted to make the point that one by-product of this HMM is that it allows one to actually measure (by looking at the difference in the hot and cold probabilities) the size of a player’s streakiness. Also by fitting this model, one can compute the probability a player is hot over the season — here is a graph from my ppaer of the probability of a hot probability of hit state for 1988 Carney Lansford. (In those days I was using MATLAB for my graphs.)
Predictive Performance of a HMM?
I think the HMM is attractive in that it is relatively simple to describe and feeds into the baseball fan’s perception of hot and cold abilities of ballplayers. In fact, I illustrated what I called “true” streakiness in baseball hitting in Chapter 5 of Curve Ball by using the HMM. But Arthur and Matthews give no indication that this HMM model is reasonable for representing pitcher fastball velocities. That is, the article says little about predictive performance of their fitted model.
If you look at, say the pitch velocities of Cole Hamels’ two-seam fastballs over a season, you’ll see a bit of variation. Looking more carefully, not only is there variation in pitch speeds within a game, but there are notable differences in speeds between games. If one uses a random effects model, one can estimate both sources in variation. The point is that one can use a variety of models to try to interpret this pattern of pitch speeds and for some reason, the authors chose the two-state HMM. Personally, I believe that a pitcher goes through a number of states (more than 2) that range from very cold to very hot — I could fit a model that assumes, for example that Hamels has seven states, and fit this model and get different measures of streaky ability. Without implementing some reasonable model selection or prediction out-of-sample procedure, it would be difficult to say if my fitted model is worse or better than Arthur and Matthews’ simple two-state model. Actually even if the pitcher was truly consistent and the mean fastball velocity did not change across games, by fitting the HMM, one would still find differences between the mean velocities between cold and hot states, and of course, these differences would not be meaningful. In my recent hot-hand writing (for example, look at this Chance article), I have focused much more on predictive performance rather than model fitting since I thought that was the main issue.
Can the HMM be a reasonable model?
From a baseball perspective, I don’t believe the two-state HMM is reasonable for representing the variation in fastball velocities. But there might be some situations where the HMM makes more sense. For example, suppose a pitcher is experimenting with distinct mechanics and one wishes to assess whether he is using one form or the second form based on fastball velocities. Or maybe the pitcher experiences an injury which affects his fastball velocity. This might be better called a changepoint analysis. A few days ago (August 10), the Phillies coaches detected that Vince Valasquez had some injury issue in the first inning when they observed a sudden drop in his fastball velocity (they probably did not use a statistical study to make that assessment).
The main problem is that readers will focus on the title of this article and gloss over the statistical work and so they will think that these authors really have found evidence for the hot hand in pitching. We all know that pitchers have good games and bad games, but the challenge is to find evidence of streakiness using good measures of performance (not just fastball velocity) that reflect solely on the pitcher. When a pitcher has a bad game, the manager often laments about the poor location of some of the pitches. Okay, this would motivate the consideration of some measure involving pitch location and then one could look for streaky patterns in a pitcher’s location performance. At the very least, this would be stating a reasonable question and trying to address this question by use of a statistical study. In contrast, it seems that these authors decided on a title for an article and then proceeded with an analysis that seems disconnected from the implied question from the title. We (that is, the statistical community) should do better.
Response from Greg Matthews
I first want to say thank you for giving us an opportunity to respond to your article before it is published. Secondly, we appreciate your feedback, even if we disagree in some cases. In the following, we try to summarize and respond to your critique point by point.
We took from your article a few main lines of criticism:
1) You note that our model works with fastball velocity, which is not a direct measure of pitcher performance. That’s true. We chose to analyze fastball velocity in large part because it is more consistent than standard measurements of performance like ERA and FIP.
However, nearly every study we’ve seen in sabermetrics (and there have been many) has found that fastball velocity correlates extremely well with a number of measures of pitcher success. We know that, physiologically, the faster a pitch comes in, the less time a batter has to swing, degrading their performance. We know that when pitchers gain velocity, they tend to perform better and vice versa.
Therefore, to argue that increased fastball velocity doesn’t improve a pitcher’s ability when they are on a hot streak, you would need to contradict a well-established body of research in sabermetrics and human physiology. We see no reason to suspect that the increased velocity pitchers manifest when “hot” doesn’t provide a performance boost.
Furthermore, we didn’t just take it for granted that pitchers would improve. We showed that performance actually does get better when pitchers are hot. We looked at the rate of swinging strikes, the batting average against, and the rate of extra-base hits against a pitcher when they are hot. We found that “hot” pitchers show better outcomes in each of these measures, even accounting for their additional velocity.
2) You note that we implied the Hidden Markov Model (HMM) is a new method. If we gave that impression, we apologize. The HMM is a well established method. However, to the best of our knowledge the data that we used (PitchF/X-derived fastball velocities), along with the way we modeled this data (a “mixed” effects HMM that pooled information across pitchers) has never been done before.
Due to limitations on the length and complexity of the article, we weren’t able to fully specify the model. And it’s important to remember that we were writing for a general audience, most of whom would not be familiar with a HMM. These factors constrained our ability to describe what was new about the model, and might have given you the mistaken impression that we were claiming that HMMs, in general, are new. You can view the full code for the model that we ran here: https://github.com/gjm112/HMMbaseball/blob/master/TwoStateModel_Pitchers_2016.R
3) You wrote that we give little evidence of the predictive performance of our model. We respectfully disagree. In fact, we measured predictive performance in a couple of different ways.
In one section, we ran our model on the first two months of data for the 2016 season. We then set up an HMM model based on the estimated parameters from the first two months of the season. We then ran the Viterbi algorithm on the sequence of n pitches. Our prediction for the state at n+1 was simply then the state at time n. This was repeated for each pitch (i.e. Viterbi on n pitches, predict n+1) for the rest of the season. Going forward from June 1, we found that our model was consistently more accurate at producing velocity estimates than the season-long average and that the median of the distribution of relative velocities was higher when the prediction was “hot” than “cold”.
Secondly, as noted above, we found that “hotness” predicted performance (in terms of swinging strike rate, BA against, and XBH rate, among others). This analysis lends further support to the idea that our model is producing useful predictions.
4) You note that we provided no justification for using a two-state HMM, as opposed to another number of states. That’s true, and largely because of the limitations on length and complexity mentioned above.
That said, in previous analyses, we also fit a three-state HMM. The three-state model produced mean state estimates (in terms of fastball velocity added/subtracted) that were not substantially different than you would expect from the difference of state means in a two-state permutation-based fit of the model (more on this below). As a result of this, we believed three state models (and above) to be overfitting, and stuck with a two state model.It may be that three state HMMs fit some pitchers and not others. In the future, we may consider an extension of the model to deal with this possibility (i.e. Some pitchers are two-state and others are allowed three-states). But we were satisfied that the two state model was a significant advance, and focused on those results for our initial article.
5) Another of your criticisms relates to the lack of justification for selecting the two state model. You write (correctly) that a two-state HMM would find state differences, and therefore streaks, regardless of whether they were truly there. This is true. However, in order to guard against this, we randomly permuted the sequence of each pitcher’s fastballs and re-fit the HMM on the permuted data. We found that the permuted data generated differences between hot and cold states that were much smaller than we found in the real data. In addition, the transition matrices for these permuted sequences were close to .5 in each cell, meaning that states flipped randomly. In the real data, our transition matrices tended to feature fewer transitions between states, and much higher probabilities to remain within a state, resulting in much longer streaks.
6) Finally, we are disappointed in your summary of our article. We deeply respect your work both within and outside of sabermetrics, but we strongly disagree with the assertion that we structured our analysis around a headline. In fact, we killed one section of the article (on hitter performance) after we concluded that there wasn’t enough evidence.
With that said, we stand by our conclusions and note that many of the criticisms you made were addressed, in some form, within the article. Others arose from misunderstandings driven by the brevity of the article and the fact that it had to be addressed to a general audience. We intend to more fully describe the methodology and implement additional predictive checks and validation steps in an academic article we are currently preparing. In writing for a broad audience, we found it challenging to specify the model in a way that wouldn’t distract or discourage our readers. Necessarily, we struck a balance between statistical rigor and accessibility, and perhaps you believe we went too far towards accessibility.
Respectfully, we disagree. For statistical analysis to reach the general public, we have to learn to communicate in an approachable fashion to people without extensive quantitative or mathematical backgrounds. That shouldn’t (and doesn’t need to) come at the cost of accuracy: Statisticians can communicate with each other at the same level of detail as before (as we have endeavored to do here, and will continue in our upcoming academic paper). But if every attempt to communicate findings to the general public is criticized because it lacks that rigor, we risk discouraging researchers from describing their work and its importance to those outside the statistical community.
We very much appreciate being given the chance to respond to your criticisms. Thank you, Jim.