Last Week’s Post
In the last post, I considered the problem of predicting a team’s second half W/L record based on its W/L record in the first half. I was using the win/loss difference W – L as my measure of success.
Based on reader comments, there were two issues on my post.
- Some readers were confused when I looked at the change in win loss difference from the 1st to 2nd half (difference of differences?)
- Actually, as Andrew Wheeler has commented, it is misleading to make an inference based on the negative pattern in the (W/L record, Change in W/L record) scatterplot.
It is Misleading to Look at the Change Values
Let me first address the misleading issue by use of a R simulation. Suppose we take a measurement of a team’s performance for the 1st half and the 2nd half when the true correlation between the 1st and 2nd half values is equal to rho. I simulate 30 values of the (first half, 2nd half) values for the 30 teams from a bivariate normal distribution with zero means, unit standard deviations and correlation rho. Then I look at the observed correlation between the first half value and the Change = 2nd half value minus 1st half value. I repeat this simulation 1000 times and collect the 1000 observed correlations that I compute.
I did this simulation experiment four times for true correlation values of rho = 0, rho = 0.3, rho = 0.6, and rho = 0.9. Below I construct density graphs of the observed correlations for each of the four cases.
Several interesting things to see here:
- The relationship between the first half measure and the change score (difference between 1st and 2nd half scores) is negative even when the first and second half measures are uncorrelated (rho = 0). This is not intuitive — one would think that the first half and change measurements would have little relationship in this case.
- As the relationship between the 1st and 2nd half measures gets stronger (moving from Rho = 0 to Rho = 0.9), the relationship of the 1st half measure and the change score gets weaker (moves towards zero). Again, this is nonintuitive to me but perhaps there is a good explanation.
Back to the Prediction Problem
Okay, here’s a better way to look at the prediction problem. We are interested in predicting a team’s 2nd half performance (defined by W – L) using the 1st half W – L performance. Here I graph the first and 2nd half performances for the 2016 season.
If we thought the 1st half record is the best prediction of a team’s 2nd half record, we would be wrong. This graph demonstrates regression to the mean since the fitted slope is 0.45 which is smaller than 1. If we repeat this analysis for the past 50 seasons, we’d find that a typical slope is 0.53. So if a team like the Phillies has a W – L difference of -30, we’d predict its W – L difference in the 2nd half to be 0.53 times (-30) which is approximately -16.
Thanks to Andy Wheeler who commented on the misleading issue on my graph. Actually, Andy has posted on this issue here.
On next week’s post, I plan to look at some hitters who have unusually high or low batting averages on balls in play and I’ll use some Statcast data to help understand what variables influence the BABIP values.