#### Sale on 2nd Edition of ABWR

Today is Cyper Monday and today there is a sale (30% discount) on the 2nd edition of *Analyzing Baseball with R* that will be available soon. If you are interested, check out the Chapman and Hall book site.

#### Introduction

Most baseball fans are aware of the recent surge in home run hitting. In the 2016 season, there were 5610 home runs followed by a big increase to 6105 home runs in the 2017 season. Well, things quickly cooled off — only 5585 home runs were hit in the following 2018 season. That raises some questions about what is going on. Was the baseball juiced in 2017? What impact does the current emphasis about increased launch angles have on home runs? Did the weather play a role in the decrease in home runs in 2018?

With the availability of Statcast data, there are opportunities to explore the pitch by pitch data to help answer questions about home run hitting. Here I am going to focus on a related issue. We know that the distance of a ball put in play depends on the launch velocity and the launch angle. This raises several questions

- What was the relationship between distance and (launch angle, exit velocity) in the 2017 season?
- Has there been a change in this relationship that might account for a drop in home run hitting in the 2018 season?

#### An Initial Exploration

To get started, I constructed a scatterplot of the launch angle (degrees) and launch speed (mph) for a random sample of 1000 balls in play for the 2017 season. I have colored the point by the distance traveled (feet). Several things are clear from this graph. First, it is important to get the ball in the air (positive launch angle) if a batter wishes to get any distance. Also there is a “sweet spot” of angles between 20 and 40 degrees and launch speeds above 90 mph that seem to result in the longest distances.

#### Modeling

To get a better understanding of the relationship of distance with launch angle and launch velocity, I fit a generalized additive model (gam). Basically, it says that distance is a smooth function of the two variables. I fit this model to all balls put in play for the 2017 season. To describe the predictions, I use a contour graph. Below I display contours where the predicted distance is equal to 200, 250, 300, 350, and 400 feet. This graph demonstrates the importance of launch angle — note how the predicted distance increases rapidly as the launch angle changes from 10 to 20 degrees. Also the yellow line brackets the sweet spot where the predicted distance exceeds 400 feet.

#### Comparing 2017 and 2018 seasons

We wish now to compare the distances of balls put into play for the 2017 and 2018 seasons. We know there were a significant fewer home runs hit in 2018 which suggests that balls didn’t travel as far this season. Since launch angle and launch velocity play an important role in distance traveled, we’d like to control for these variables in this exploration.

Here’s what I did to compare the 2017 and 2018 distances.

- Let’s focus on a particular batter, say Mike Trout. For each of Trout’s batted balls in the 2018 season, I can use the launch angle and exit velocity values to predict the distance traveled using our gam model on the 2017 data. I decided on only considering batted balls where the launch angle was positive, since those are the balls that can be hit for a substantial distance.
- For the 2018 season, Trout had 272 balls hit in the air. For each of those balls in play, I predict the distance traveled using my 2017 gam model. The median distance of these predicted distances was 278.95 feet. The actual median distance of these balls in play was 283 feet. So actually, Trout hit these balls, on average, about 4 feet further in the 2018 season than one would predict based on the 2017 gam model.
- Okay, this is interesting since I initially thought the median distance would be smaller in 2018 compared to 2017 (since we had fewer home runs hit in 2018). But of course Trout is only one batter. I repeated this procedure for all batters in the 2018 season. For the 2018 launch angles and exit velocities, I predict the distances traveled (for balls hit in the air) and compute the
**median 2018 distance minus the median predicted distance from the 2017 model.**Below I graph these differences against the number of batted balls (I only show the points for players who had at least 100 batted balls in 2018.). I use a loess curve to show the pattern and the red horizontal line is at zero. Now this is weird. Obviously there is a lot of scatter (some players hit for higher average distances in 2018 and other players hit for lower average distances), but generally players tend to hit for higher distances, on average, in 2018, although the difference is only about 3 feet.

#### Takeaways

- Of course, there were a significant fewer count of home runs hit in 2018. In another study, I explored the characteristics of batted balls in the so-called red region (this is the region of launch angle and launch velocity values where it is likely to hit a home run). What I found is that there were more balls hit in this red region in 2018, that is, there were more opportunities to hit home runs in 2018. But the actual proportion of home runs of balls in this red region dropped substantially. This home run study suggests that balls in play did not carry as well in the 2018 season. This brief analysis on distances traveled actually says something different — it suggests that after adjusting for launch angle and launch velocity, balls actually traveled further, on average, in 2018.
- Well, at the least, this work suggests that there is more to be done to understand the reasons for the drop in home run hitting in 2018. I will certainly do more exploration, but I encourage the interested reader to explain the possible inconsistency between increased distance and decreased home run production.

I read a news article about Gary Whisnant’s run formula, and found the paper online.

https://www.researchgate.net/profile/Kerry_Whisnant/publication/266473425_Beyond_Pythagorean_expectation_How_run_distributions_affect_win_percentage/links/574b1cb908ae5bf2e63f33f1.pdf

Wanting to learn more about run prediction I found Carl Morris’ “A simple runs per game formula”,

http://www.stat.harvard.edu/People/Faculty/Carl_N._Morris/Carl_N._Morris_Sports_Articles/Runs_per_Game_Paper_(Short%20Version).txt

I’m just learning “R”, and turned Carl Morris’ equations into a simple “R” program. His results gave a single number, but I was aware from posts by Tango Tiger and Keith Woolner that runs per inning followed a roughly geometric distribution, and that variance in actual runs could affect relative win/loss outcomes.

Your June 20, 2016 post was great, and filled in the blanks. I just used 9999 rather than your 10000, and broke it into an 1111 by 9 matrix, and summed the strings of 9 to get 1111 game totals.

Plugging in the sums for all major league teams in 2018, your simulation showed the geometric distribution in runs, but under predicted total runs.

I thought about that for a bit to figure out what the problem was. Base running and sacrifice hits and flys could have a minor effect, but the MAJOR discrepancy was with errors. A player could reach first on an error- the stats would show an out at bat where there WASN’T an out , and in effect, a base hit.

https://www.baseball-reference.com/leagues/MLB/2018-standard-fielding.shtml

fieldig data gives

130467 po

44657 a

2792 e

e/(po +a)= .015943 for fielding error rate.

Plays without assists = 130467-44657= 85810/130467= .657714

Plays with assists = 44657/130467= = .342286

Assuming errors uniformly distributed, full error rate is

.657714 *.015943 +.342286*.031632= .010486 + .010827 = .021313

1-.021313=.978687 fielding average, since plays requiring assists are completed (1-.015943)^2 of the time =.968368

This error rate won’t affect walks or home runs since the ball is not in play on those occasions. I figured that

effective singles will increase by (ab-h) *.021313

effective doubles will increase by (s)* .021313

effective triples will increase by (d)* .021313

effective home runs will increase by (t)*.021313

I used the “corrected” 2018 major league totals data, multiplied the mean(runs per game) scored by 162 to compare to actual results- Your “adjusted for errors” program(999999 simulations) was within 1% of major league totals for 2018. I think the difference is that batting ability is not distributed evenly through the batting order, but the better hitters are grouped at the front of the order, the worse hitters at the back of the order.

Since high slugging teams have less variance than teams relying more on singles and bases on balls, they should win more than their share based on relative runs per inning produced. I’m going to play with your program and Carl Morris’ program, trying to find teams with equal run production but with different distributions- and seeing how that would affect outcomes