The Prediction Problem
Currently it is half way through the 2017 season and we’re observing some interesting team records:
- The 2016 World Series Winner, the Chicago Cubs, is currently 41-42.
- The Atlanta Braves are currently 40-41 in a rebuilding year.
- The LA Dodgers have a 55-29 record with a 22 1/2 game lead over the SF Giants who have a 33-52 record.
- The Phillies are 28-54 (it is interesting since it is extreme but perhaps not that surprising).
That raises the obvious question: Given a team’s first-half record, what is a reasonable prediction at a team’s W/L record in the second half?
One way of describing a team’s record is the difference between the number of wins and losses (L):
Difference = W – L
So currently, the Dodgers’ Win/Loss difference is 55 – 29 = 26 and the Phillies currently have a Win/Loss difference of 28 – 54 = -26. (By the way, it may have made more sense to talk about “games over .500”, but one would get similar 2nd-half predictions.)
So I’ll rephrase my question this way: If you know a team’s W/L difference in the 1st half, what is a good prediction at the team’s W/L difference in the 2nd half of the season?
Retrosheet has single files for the game results for each season. I find it convenient to download all of the game log files and store them in a single folder on my computer, so it is easy to collect them for specific studies of interest.
The R Work for One Season
I have a single R function
all_work.R that does all of the work for a particular season of interest.
- For each team, I find the W/L record for all games through July 1 (the first half record) and also the W/L record for all games played from July 2 to the end of the season (the second half record).
- I graph the first half W/L difference against the
CHANGE = 2nd half W/L difference MINUS 1st half W/L difference
Here is a illustration of this graph for the 2016 season where I label the plotting point with the team abbreviation.
Note that we see a negative relationship — teams that do well in the 1st half (like Texas and Cleveland) don’t do as well in the 2nd half. Similarly, teams that have bad 1st halves (like Minnesota and Cincinnati) tend to do better in the 2nd half. (This gives some hope for Phillies fans in the 2nd half of the 2017 season.)
To understand this negative relationship, we fit a line — the slope here is -0.51. This means that if a team has a W/L difference of 12 for the first half, we would predict the team to have a 6 W/L drop (to a W/L difference of 12 – 6 = 6) in the second half.
When I try this analysis for other seasons, we see a similar pattern, but the slope estimate can change. Here is the graph for the 2015 season.
A 50 Season Study
To see if this pattern holds across seasons, I repeat this analysis for all seasons from 1967 through 2016 and collect all of the slopes. Here is a dotplot of the 50 slopes — we see that the average slope is -0.44 and the single season slope can vary between -0.8 and -0.1.
Let’s use the average slope -0.44 as our estimate of the relationship between a team’s first half W/L difference and the change in W/L difference. So my prediction is
2nd Half W/L Difference = 1st Half W/L Difference – 0.44 x 1st Half W/L Difference
Let’s apply this:
- The Phillies currently have a W/L difference of -26. I compute -26 * (-0.44) = 11.24. So I predict that the Phillies will improve in the 2nd half — their W/L difference will be -26 + 11 = -15.
- The Dodgers currently have a W/L difference of 26 — I predict their W/L difference will drop by 26 * (-0.44) = 11 which corresponds to a W/L difference of 26 – 11 = 15.
Now with additional information, one can get better estimates of a team’s W/L record in the second half. For example, I know the Cubs have a strong team, so it is reasonable to predict that their W/L record will improve in the 2nd half. But I’d need to look more carefully at the individual players and predict the 2nd half performance of each player, and collect these predicted performances to get a prediction at the team’s performance in the second half. But this “-0.44” method provides a quick method of predicting team performances by just using the team records.
To see the R function
all_work.R that produces the above graph, see my gist site. I am assuming that one has collected the Retrosheet game log files and put them in a single folder on your computer. You might be interested in seeing how I find a team’s W/L record based on the single game results. Or you might be interested in seeing the
ggplot2 code to produce the graph. The
ggrepel package is useful in plotting point labels so they don’t overlap.
The negative relationship between the two change scores is purely a function of regression to the mean.
P1 = W – L [First Half]
P2 = W – L [Second Half]
Y = P2 – P1
You are then plotting Y versus P1 and noting the correlation is about -0.5. Without loss of generality, assume that P1 and P2 have zero means (which also makes Y have a zero mean). Then you can write the covariance between Y and P1 as:
Cov(Y,P1) = Cov(P2-P1,P1) = E[(P2 – P1)*P1]
Where “E” is the expected value and and “Cov” is the covariance. Because of the bilinearity of the expected value we then have:
E[(P2 – P1)*P1] = E[P2*P1 – P1*P1] = E[P2*P1] – E[P1*P1]
Which notice that the last two terms can be written as variances and covariances of the original P1 and P2 terms.
E[P2*P1] – E[P1*P1] = Cov(P2,P1) – Var(P1)
Since the covariance between the two series has to be less than the variance, this relationship *will always* be negative. In the case that P2 and P1 are unrelated, that is Cov(P2,P1) = 0, this results in the correlation between the two being around -0.5, because note that the denominator for the Cor(Y,P1) is SQRT[Var(Y)*Var(X)].
So since the correlations between Y and P1 over the years is really close to -0.5, this is pretty consistent with P1 and P2 having very little relationships with one another. Since the average is -0.44 they might have a very slight positive correlation with each other over the whole sample.