I just returned home from the Carnegie-Mellon Baseball Workshop last Saturday. It was a big success. We had about 60 students participate in the sessions and a large group of us went to a Q&A session with the Pirates analytics group. I know some of the workshop participants traveled a long way to the meeting — I met a coach from California and several students from the University of Hartford (CT). (I would love to organize a similar event for Cleveland next spring.) I thought I’d share some of the code and graphs that were discussed on Saturday.
The Morning Session
Mark Patterson from CMU was the presenter in the morning. The purpose of the morning session was to introduce R, RStudio, and the Lahman package containing a large quantity of season-to-season statistics for players and teams. The R code for the morning session can be found on Ron Yurko’s site and you’re welcome to try this out. I’ll discuss some highlights including some of the graphs that Mark presented.
It is fascinating to explore the change in home run hitting over the history of baseball. Mark showed how to plot the leading home run count for all seasons — this plot led to a discussion of Roger Maris’ 61, Mark McGwire’s 70, and Barry Bonds’ 73. Sammy Sosa had a number of seasons with 60+ home runs, but he tended to be overshadowed by the leaders.
It is also interesting to explore the total number of home runs hit for all seasons.
This graph is a bit misleading since the number of opportunities to hit home runs has increased since we have more players and longer seasons. It makes more sense to focus on the proportion of home runs hit per plate appearances. There are three true outcomes — home runs, walks, and strikeouts — and it is interesting to see how the rates of each of these TTOs has changed over time. (Although the labelling could be improved, what is the TTO corresponding to the red line in the graph?)
We had a nice Subway lunch and I gave a general talk about the opportunities using the new Statcast data. You can find my slides of my talk here. I think my talk was a nice segway to the discussion of the PitchFX and Statcast data in the afternoon.
The Afternoon Session
Ron Yurko from CMU led the afternoon session, introducing some of the publicly available Statcast data. (The R code for the afternoon session is also available.) Since we were going to watch the Pirates/Reds game that evening, it was relevant to explore the hitting characteristics of Joey Votto, the great hitter for the Reds. (By the way, Votto went 2 for 4 with a sacrifice fly in the game Saturday evening.)
Here Ron explored how Votto’s whiff rate and his OPS varied as a function of the pitch type. One interesting finding is that Votto does especially well against sinkers (coded SI in the graph). He appears to rarely miss sinkers (a low whiff rate), and does well hitting against sinkers and curve balls (high OPS rates).
Votto is known to be a very disciplined hitter. Ron illustrates this fact by graphing his fitted swing probability as a function of the location of the pitch about the zone. He appears to rarely swing at pitches outside of the zone.
Where does Votto hit balls in play? Ron showed how to construct this spray graph of his batted balls using a generalized additive model (GAM) fit. He appears to pull his ground balls to the right side of the infield, but his balls in play to the outfield tend to go to all fields. (By the way, at the game, both of Votto’s singles on Saturday night were hit to the opposite field.)
- Although I can’t speak for the participants, I think this workshop was beneficial at least in showing off the data and demonstrating some interesting analyses. I encourage you to try out the sample R code. I think it is a bit remarkable that anyone with a laptop can get started by downloading R and installing the Lahman package.
- Many of this think that a baseball gig would be a dream job, but we learned from the Pirates analytics folks that it can have its challenges. For example, it can be hard to handle the roller coaster nature of the sport. Losses impact the whole organization. Also communication of their work to other people in the organization can be challenging.
- These type of R workshops should be more than just “show and tell” by people who are well-versed in R and baseball, but it can be hard to achieve the right balance of teaching and activities. One learns R best by working on a particular exploration, guided by some of the work presented here. In my teaching, I try to lecture less and devote more of the class time to working together on some learning activity. I think I am more effective interacting with the students — in this way, I am directly addressing their needs.