# Cleaning Statcast Data

#### Introduction

As you probably know, I have devoted several blog posts (such as this post) to the relationship between home run hitting and off-the-bat measurements including launch velocity and launch angle. One assumption that one commonly makes is that data is clean in the sense that it doesn’t contain any mistakes or erroneous values. Jonathon Judge’s recent Baseball Prospectus post raises caution on this assumption for Statcast data. There are several issues here. First, there are questions on the accuracy of Statcast measurements. I am not aware of any statement of the accuracy in measuring, say a launch velocity. Second, it appears that some of the launch velocity and launch speed measurements are missing for some batted balls and MLB makes the data complete by imputing (substituting) some numbers for these missing values. Unfortunately, we don’t know which values are imputed. That motivates me to graphically explore the 2019 Statcast data (through June 23) in this post and look for unusual features in the distributions of launch velocity and launch angle that might suggest some issues with the data. Honestly, I should have paid more attention to this cleaning issue in the past before I proceeded with any modeling.

#### Distribution of Launch Velocity and Launch Angle

To start, I notice that launch velocities for 2019 data are recorded to the nearest tenth of a mph which suggests that the accuracy of these velocity measurements is limited. It also allows me to actually explore the actual launch velocity values of the 60,198 batted balls and construct a graph of the frequencies of the actual recorded values. It is pretty obvious from the graph below that there is an issue — particular values of launch speed have high frequencies which suggests that there is some pattern to the way that MLB is imputing missing values.

Likewise, launch angles are recorded to the nearest integer so we can construct a graph of the frequencies of the exact recorded values. Launch angles look pretty bell-shaped but there are several LARGE values that pop up that deviate from the general bell-shaped pattern.

#### Cleaning Operations

Through some exploratory work, we can locate some of the extreme values that appear to be imputed rather than observed.

• Since the outliers are most obvious among the launch velocities, I construct a frequency table of the individual values and label “extreme” the values of launch velocity with a frequency exceeding 249. (I chose the value 249 by looking at the original histogram of launch velocities.)
• Since I suspect that MLB would simultaneously impute values of launch velocity and launch angle, I look for pairs of values that have frequencies exceeding 100. In the following table I show are the pair of values I found. Note, for example that the pair (80 mph, 69 degrees) occurs 2039 times and this is obviously fake — have you seen that many popups hit with a launch angle of 69 degrees? Also (82.9 mph, -21 degrees) occurs 1790 times. Together these nine pairs of values represent 5541 or 11% of all of the in-play batted ball measurements.
• After I wrote this statement, I thought I should check about my comment about a launch angle of 69 degrees. I looked at in-play data from the previous season (2018) — there were 4242 69-degree launch angle measurements. For both 2018 and 2019 seasons, we observe 69 degrees for 3.4% of all balls in play. Whatever 69 degrees means, it happened with the same frequency in the 2018 and 2019 seasons.
• Continuing with my fascination with 69 degrees, there were 4242 occurrences of launch speed = 80 mpg and launch angle = 69 degrees in the 2018 season. The pair (80, 69) was equally popular in both the 2018 and 2019 seasons indicating that it likely an imputed pair for missing data.

#### Distribution of Cleaned Data

Let’s redraw the distributions of launch speed and launch angle after removing the 5541 pairs of values shown above that appear to be imputed. We see much smoother representations of these two variables in these new graphs that seem reasonable. But I think some additional cleaning is needed. I see six spikes in the launch speed distribution that need attention — some of these values likely correspond to imputed data. Interestingly, I don’t see any obvious problem with the launch angle measurements. By the way, note that by focusing on correcting the launch velocity outliers, it appears that I also corrected the weird launch angle measurements.

#### Takeaways

• Always start by graphing the data. Right away we saw an issue with the Statcast launch speeds and launch angles with a couple of graphs.
• Why bother to clean? The results from any kind of traditional regression methods can depend greatly on the presence of outliers and so this type of data cleaning is essential before trying to do any type of modeling.
• What in-play data is being thrown away? I found frequencies of the events for the data that appeared to have imputed values of launch speed and launch angle. It seemed that most of the outcomes are outs — there were 441 singles, 2 doubles, and no triples or home runs in this throw-away group.
• Can MLB provide more information? Really MLB needs to add a variable which indicates if the values of launch angle and launch speed are actually observed or inputted. What do values of 80 mph and 69 degrees for launch speed and launch angle really mean?

#### Update – my R code (added June 26)

Daniel asked if I could share my R code for doing this work including creating the graphs. It is now posted on my Github Gist site.

### 5 responses

1. 420, 69. Someone has Trevor Bauer’s sense of humor.

2. Daniel S Steinberg | Reply

Will you share the R code snippet you used to clean the data? As an amateur programmer I can’t think of a way to do this efficiently.

3. I am getting a CalledStrike not available for my version 3.6.0?

1. The CalledStrike package is available — just download it using install_github() from the devtools package.

4. Can’t you just use anti_join to remove the bad observations from the IP data?
o19 %
arrange(desc(N))

sc_2019ip %>% anti_join(o19)