An interested reader asks:
“If you have a point estimate on posey, turner, zimmerman’s eoy batting average, you have an algorithm that predicts each player’s eoy batting average on any day. is it proprietary, or are you willing to provide it? … I’m certainly NOT the only person interested in the answers to those questions!”
Okay, here is some code that will provide the final season batting averages using my component method for any day of interest. I recently described my component method. Basically, one breaks down the BA into three rates: the K rate (SO / AB), the HR rate (HR / (AB – SO)) and the BA in-play rate ((H – HR) / (AB – SO – HR)), one simultaneously estimates each group of component rates using a basic multilevel model, and then one combines the component predictions for each player to get a prediction at the final season AVG. In my paper I provide this method generally is superior to the basic method of shrinking batting averages towards the mean.
Here are the details for getting daily predictions (all of the R code is provided on my gist site):
- I needed to find a website that provided the standard batting data for the current day in the MLB season and could be easily imported into R. Looking around, I was successful in reading the batting data (for qualifying hitters) from the Sports Illustrated page using the
htmltabpackage. So I wrote a short function
collect_datathat does the scraping of the current day’s data.
- Once I have collected the data, I just apply the
fit_comp_halffunction (also on my gisthub site) to implement this method.
I am writing this Sunday morning June 4. The following graph shows the current AVG and the final season predictions. I have identified some interesting points from the graph. Miguel Sano is an example of player who is predicted to drop in AVG. Murphy, Zimmerman, and Turner are all predicted to finish above .300, but in a different order than their current AVGs.
Here are the computations for Miguel Sano who currently has a BA of .299 but I predict will have a final season AVG under .250. The fit_comp_half function outputs a data frame of computations for all players.
filter(d2$S, H / AB1 > .29, Comp.Est < .25) playerID SO AB SO.Rate HR AB.SO HR.Rate H.HR AB.SO.HR 1 12 Sano, M. 77 174 0.3973213 13 97 0.09818932 39 84 H.Rate Comp.Est H AB1 Shrinkage.Est 1 0.338869 0.2433526 52 174 0.2753458
Sano’s component rates are SO Rate = 77 / 174 = 0.442, HR Rate = 13 / 97 = 0.134, and BABIP rate = 39 / 84 = 0.464. All of these rates are relatively large. The final season predictions at these rates are respectively 0.397, 0.098, and 0.339. All of these predictions move the observed rates towards the corresponding averages, but the movement towards the average is most severe for the BABIP rate. The prediction at Sano’s final AVG is
Predicted AVG = (1 – 0.397) * (0.098 + ((1 – 0.098) * 0.339) = 0.243
which is significantly smaller than his current AVG of 52 / 174 = 0.299.
Anyway, I encourage the reader to try out this code for any day this season. The results are most interesting when the current season averages have a lot of variability.
Added Later in the Day
To make these functions easier to use, I put together an R package
BApredict containing three functions —
collect_hitting_data() collects the data from the SI site,
component_predict() computes the predictions using my method, and
graph_predictions() constructs a scatterplot of the current and predicted BA’s. The code below installs and loads the package and runs these functions using today’s data.
install_github("bayesball/BApredict") library(BApredict) d <- collect_hitting_data() out <- component_predict(d) graph_predictions(out)
R packages are easy to construct nowadays — maybe that will be the subject of next week’s post.