Retrosheet Data and Home Runs
Retrosheet recently made available the game accounts and play-by-play files for the 2017 season. I have a web page on how to download Retrosheet play-by-play files and compute run expectancies and this information still works for the current season.
Of course, one of the main points of discussion about the 2017 season was the record number (6105) of home runs hit. To celebrate the new Retrosheet data, I thought I’d present some interesting explorations of the 2017 home run data, and contrast 2017 with the 2000 season when the second most season home runs (5693) were hit. Recently, I wrote a short post comparing the home run hitting for the 2016 and 2017 seasons.
Home Run Weather Effects
Since we have the number of home runs hit for each game, it is easy to find the home run rate (that is, home runs divided by plate appearances), for each day of the 2017 season. Here’s a scatterplot with a smoothing curve added. It is interesting that the April and May home run rates were low, rates seemed to stabilize in the middle of the season, and there was a small decrease in the home run rates towards the end of the season. Is this a cold weather effect? (As one knows, the weather can have a big impact on the flight of fly balls and home runs.)
How Long Does One Wait to Observe a Home Run?
As you might know, I am interested in exploring streaky patterns in hitting. Suppose we record the specific plate appearances where one observes home runs and compute the spacings — the number of plate appearances between the PA’s of successive home runs. Here’s a histogram of all of the spacings (waiting times where unit is a PA) between home runs for the 2017 season. If an occurrence of home runs across PA’s is a coin-tossing process with a constant probability of success (that is, hitting a home run), then the spacings will have a geometric distribution. I’ve attached a matching geometric probability curve. It does a reasonable job at matching this spacing distribution although the assumptions are not true (for example, HR’s are not equally likely to occur during each position in the lineup). By the way, the median spacings is 14 PA, so a baseball fan will wait, on average, 14 PA before she observes the next home run.
Runs Values of Home Runs
When I download the Retrosheet data, I also compute the runs values of each play. So it is easy to plot the runs values of the 6105 home runs. As this graph shows, most of the home runs are the bases empty variety with a runs value of 1. Of the home runs with greater runs values, most fall between 1.5 and 2 runs.
Comparing 2017 and 2000 Seasons — Which Rate?
I wanted to compare the home run hitting for the 2017 and 2000 seasons. Suppose one looks at only the batters with at least 200 plate appearances. If one defines a home run rate as HR / PA, then one finds that
Median HR Rate for 2017 = 0.032
Median HR Rate for 2000 = 0.029
so that the Median HR rate in 2017 is 0.032 – 0.029 = 0.003 higher. But given that there were many more strikeouts in 2017, maybe it would make more sense to divide the number of home runs (HR) by the number of balls in play (PA – BB – SO). Then we find
(IP) Median HR Rate for 2017 = 0.044
(IP) Median HR Rate for 2000 = 0.037
so focusing on balls in play, the Median HR rate in 2017 is 0.044 – 0.037 = 0.007 higher.
Above we are focusing only on the average HR rates for the two season. To better compare the two distributions of home run rates, we compute the quantiles for a set of probability values from 0 and 1 for both distributions. Below I have plotted the 2017 quantile minus the 2000 quantile for different probabilities using the HR / PA definition and the HR / (PA – SO – BB) rate definitions. We see several things:
- Clearly the home runs rates are more different between the 2000 and 2017 seasons if we look at the rate of home runs for balls put in-play.
- For the sluggers (corresponding to the high proportion values), the differences between the two season home runs are smaller.
- For the In-play rate, the two season rates are most different for the “above average” sluggers (corresponding to proportions values from .6 to .7).
This was a quick exploration of home runs using the new 2017 Retrosheet data, but this work does suggest some follow-up questions for future exploration.
- What about the home run weather effects? Are home runs harder to fit during colder weather? To begin, I’d look into the rate of home run hitting for other seasons to see if there is a April/May effect.
- Although home run spacings are approximately distributed according to a geometric distribution, there appear to be some deviations from geometric in the above graph that deserve a closer look.
- The rise in strikeouts decreases the opportunities to hit home runs as demonstrated by this comparison of the 2017 and 2000 seasons. It would be interesting to look more carefully at the relationship between strikeouts and home runs.