### Introduction

I’ve been reading Keith Law’s new book *Smart Baseball* (I received it as a Christmas gift from my daughter and son-in-law), and it has been an enjoyable read. It has a nice three-part structure – in Part 1, Keith trashes many of the traditional statistics such as batting average, pitcher wins, runs batted in, and saves, Part 2 gives an overview of modern baseball measures such as WAR, wOBA and WPA, and the final part looks to the future of sabermetrics with the availability of Statcast data. I’d recommend this book for any reader interested in an overview of the current landscape in sabermetric thinking. In particular, in the “future” section of the book, Keith mentions the importance of the Statcast variables launch angle and exit velocity off of the bat in determining favorable outcomes for hitters.

Law’s book motivated me to explore how one can use Statcast data to gain additional insight on hitting performance. In a previous post, I demonstrated how one can use a generalized additive model to predict the probability of a base hit given values of launch angle and exit velocity. Generally, balls hit hard with the “right” launch angle (think “line drives”) tend to be base hits. Okay, but there are other factors involved in “success” such as the direction of the batted ball and the speed of the batter in reaching first base. How can one measure the importance of these other factors in producing base hits?

### Fitting the Model

I have Statcast data for all batted balls in the 2017 season. I fit the generalized additive model of the form

where p is the probability of a base hit, LA, EV are respectively the launch angle and exit velocity, and s() is a smooth function of the two inputs.

### Expected Number of Hits

For each player, I collect all of the launch angle and exit velocity values. Using the model, I can predict the probability of a hit for each value of (LA, EV). If I sum all of these probabilities, I get the Expected Number of Hits for each player. Below I’ve constructed a scatterplot of Expected Hits and Actual Hits for all 2017 batters. As one might expect, the model does a pretty good job of predicting the hits for all players.

### Residuals

But we notice variation about the line and we’re actually interested in how many more (or fewer) hits a player has then what is predicted based on the launch angle and exit velocity. So we consider the Residual

Residual = Hits – Expected_Hits

and we graph the Residuals against the number of Batted Balls for players below. We see some interesting (that is, large and small) residuals — but it is hard to interpret these since residuals tend to be larger for players with more batted balls.

### Standardized Scores

It is easier to interpret these residuals by standardization — we divide each residual by the square root of the expected count to obtain the standardized or Z scores.

Generally if the model is a reasonable fit, we anticipate most of these Z scores to fall between -2 and +2. Here I graph the standardized scores and label the “interesting” Z scores that are larger than 2 or smaller than -2. In particular, Dee Gordon, Jose Altuve, and Ender Inciarte stand out at the high end — each has a much larger count of hits than one would predict on the basis of launch angle and exit velocity. Miguel Cabrera, on the other side, stands out with a large negative residual.

### Looking Further

Let’s focus on Dee Gordon who had the largest Z score. Gordon had 201 hits which was 39 more hits than we’d predict (162) on the basis of his launch angle and exit velocity values. Below we’ve constructed scatterplots of Dee’s launch angles and exit velocities — the top graph colors the point by the actual outcome (Hit or Out) and the bottom graph classifies the point as Hit or Out on the basis of it’s predicted hit probability. (If the predicted probability exceeds 0.5, we classify it as a hit.) It is clear that Gordon had many ground balls for hits which would contribute to his high hit total. (Groundballs with negative launch angles are predicted to be outs under the gam model.)

On the other extreme, Cabrera had only 117 hits which was 28 hits below what one would predict (145) on the basis of his LA and EV values. Looking at the graph, we see that Cabrera had a large number of hard hit balls that were outs. I don’t regard Cabrera as a speedy runner, so I would not expect many groundball hits, but I am surprised at the lack of success with hard hit balls with good launch angles — was he simply unlucky this season?

### Is the Z Score Meaningful?

Some of you might be thinking that this standardized score is not meaningful — exit velocity and launch angle are the main explanations for success in hitting and the remaining variation in hits is just due to good or bad luck. To show you that this **is not** true, I repeated this exercise for 2016 data — the scatterplot below plots the 2016 Z score (horizontal) against the 2017 Z score (vertical) — there is a clear positive trend that indicates that batters generally have an ability to do better (or worse) than what is predicted based on (EV, LA). Gordon and Cabrera were extreme hitters for both seasons.

### Summing Up

This is a first attempt of using Statcast data to develop a measure of hitting performance beyond what would be expected based on their (LA, EV) measurements. Looking further …

- What about the impact of the spray angle? Are some hitters especially good in batting the ball away from the fielders? Are particular hitters good in “beating the shift?”
- I would expect that speed is a hitter talent and so speedy players like Dee Gordon with a positive Z score on season would tend to have a positive Z score in the following season.
- What is the role of “luck” or chance variation in hits that can’t be easily explained by launch angle, exit velocity, and other factors?

If the interested reader would like to see and hopefully improve my R work, go to my github gist site. I use a dataset “statcast2017.csv” that I scraped using Bill Petti’s baseballr package.

Not sure I see a very ‘clear positive trend’ between year-on-year z-scores. Can you quantify the strength of association? I’d hazard a guess that >90% of variance is ‘just due to good or bad luck’.

Well, there is a positive trend, although I agree that it is not strong. I haven’t done the calculation, but I would guess we are talking about a 0.3 correlation. The point is that there is a speed factor in getting a hit.

Hi Jim,

This is great stuff. I am trying to use some of your code to visualize other players. When I try to use the scrape_statcast_savant_batter_all() function, I can only download a maximum of 30,000 rows at a time. I’m trying to download Statcast data for the entire 2017 season. Is this a common issue when retrieving the data?

Brendan:

What I did was to download Statcast data one week at a time. A little tedious, but I was successful in downloading all of the data for the 2016 and 2017 seasons.

Jim

Thank you, Jim. I downloaded data one week at a time, too. I ended up with an object with about 735,000 rows. I forgot to save this objects and so I just re-ran my code to download the data again. Interestingly, I only retrieved about 705,000 rows despite the same exact code the second time. The only difference was that I downloaded the data a day apart. Have you experienced this?

I have about 735,000 rows for the 2017 season. I haven’t experienced your problem but sometimes the downloads don’t work for some reason.