Thanks for your posts and sharing your code. It’s especially useful for someone just started like me. So I’m curious to see about how extra innings played a role in home team wins. But to my surprise, I didn’t find # of innings played in game log files. Just would like to make sure I’m not missing anything in the file.

So I ended up extracting information from play-by-play data, but encountering error messages when pulling 2014, 2015 data. Have you seen that before?

Btw, overall at a league level, home team only win ~50% of the times when there are extra innings, another surprise. But could be pretty different by team.

– I went to your session in JSM. Very interesting topic!

Thanks!

Meng

Additionally, I am a huge fan of your book “Curve Ball”, having read it a number of years ago and always wanted to thank you for your contributions to the growth of analytics in the best game ever.

]]>Take:

P1 = W – L [First Half]

P2 = W – L [Second Half]

Y = P2 – P1

You are then plotting Y versus P1 and noting the correlation is about -0.5. Without loss of generality, assume that P1 and P2 have zero means (which also makes Y have a zero mean). Then you can write the covariance between Y and P1 as:

Cov(Y,P1) = Cov(P2-P1,P1) = E[(P2 – P1)*P1]

Where “E” is the expected value and and “Cov” is the covariance. Because of the bilinearity of the expected value we then have:

E[(P2 – P1)*P1] = E[P2*P1 – P1*P1] = E[P2*P1] – E[P1*P1]

Which notice that the last two terms can be written as variances and covariances of the original P1 and P2 terms.

E[P2*P1] – E[P1*P1] = Cov(P2,P1) – Var(P1)

Since the covariance between the two series has to be less than the variance, this relationship *will always* be negative. In the case that P2 and P1 are unrelated, that is Cov(P2,P1) = 0, this results in the correlation between the two being around -0.5, because note that the denominator for the Cor(Y,P1) is SQRT[Var(Y)*Var(X)].

So since the correlations between Y and P1 over the years is really close to -0.5, this is pretty consistent with P1 and P2 having very little relationships with one another. Since the average is -0.44 they might have a very slight positive correlation with each other over the whole sample.

]]>