This summer I enjoyed reading Big Data Baseball: Math, Miracles, and the End of a 20-Year Losing Streak by Travis Sawchik. It is an entertaining story about the 2013 Pirates — specifically in the Pirates’ use of analytics in revamping their lineup for the 2013 season.
The key statistic in Moneyball was the on-base percentage (OBP). At that particular time, OBP was under-appreciated and the Athletics were able to sign players cheaply who were good in getting on-base. Sawchik tells a similar story with newer types of baseball measures. For example, Russell Martin was signed by the Pirates primarily due to his talent in framing pitches — this ability at the time was under-appreciated by the baseball establishment.
Sawchik’s book talks a lot about the importance of groundballs from a defensive perspective. If you don’t have strikeout pitchers on your team, then a good strategy is to throw pitches that will induce ground balls that will more likely result in outs.
This groundball strategy has two components. First, a team wants to induce a large number of groundballs — you can measure this by the proportion of in-play events that are groundballs. Second, you want to get outs from groundballs — this tendency can be measured by the proportion of groundballs that are turned into outs. (One reason for all of the recent defensive shifting in baseball is to be more successful in handling groundballs.)
After reading Sawchik’s book several questions came to mind.
- Do teams really differ significantly with respect to these groundball statistics?
- Have the Pirates really done well in inducing groundballs and turning groundballs into outs in recent seasons?
Here I use Retrosheet play-by-play data to address these questions.
I have a function
groundball.plot.R that will compute and plot (1) the proportion of groundballs (among all balls put in play), and (2) the proportion of groundballs that are outs for all 30 defensive teams. The inputs are the Retrosheet play-by-play data for a particular season and the season value.
Here are some comments on this function:
- I focus on the balls that are in-play (the BATTEDBALL_CD variable is either “F”, “G”, “L”, or “P”).
- I create a variable FIELD_TEAM
- I am interested in the event “out” which is when the variable EVENT_CD is either 2, 18 or 19.
- For each team, you compute the proportion of groundballs and the proportion of groundballs that are outs.
- Using ggplot2, I construct a scatterplot of the two proportions using the team label as a plotting character.
I first run this for the 2014 season.
To answer the first question, clearly there is much variability, both in producing groundballs, and in converting groundballs to outs. Specifically, the Pirates were excellent in 2014 in both proportions. In contrast, Texas was poor with respect to both measures. These are important differences since all teams want to produce groundball outs.
Next, I run this for the four seasons 2011, 2012, 2013, and 2014, and use ggplot2 to compare across seasons.
It is interesting that the Pirates were excellent in both groundball measures in 2013 (the season of the book) and 2014. In earlier seasons (2011 and 2012), the Pirates did not stand out. Looking at other teams, Detroit seems poor in creating outs from groundballs both in 2013 and 2014.
To explore further …
- What variables might explain the variability in these groundball statistics?
- Is there a relationship between a team’s tendency to throw pitches down in the zone (such as a two-seam fastball) and the proportion of groundballs?
- If we have data on the number of infield shifts, is there a relationship between the number of defensive shifts and the proportion of groundballs that are outs?
To answer these questions, one needs to collect additional data. Data on pitch types is available through pitchFX. I’m unsure about defensive shifting data, but I suspect that aggregate data is available somewhere.