You might have seen a few weeks back that Fernando Tatis Jr. hit a home run. This isn’t novel – Fernando Tatis Jr. has hit a lot of home runs this year! But, this particular home run received backlash, even from his own manager. With his team up 10-3 in the eighth inning, Tatis launched a home run with the bases loaded on a 3-0 count, which apparently breaks an unwritten rule in baseball (of hitting a home run past the 6^{th} inning with your team up by at least 6 runs on a 3-0 count). As a response, Rangers pitcher Ian Gibault hit Manny Machado on the very next pitch.

While unwritten rules are exactly that, unwritten, they may be detectable if opponents are consistent in how they respond to unwritten rules. Below, I analyze (i) how taboo is it for Tatis to hit a home run in the given situation based on past trends, and (ii) whether we detect more unwritten rules.

**2019 Trends**

The clearest way a team can protest the violation of an unwritten rule is by hitting a batter (as seen with the Rangers in the Tatis example), and more narrowly, hitting a batter on the first pitch of the next plate appearance. We can see the change in hit by pitch rate in 2018-2019 in situations like Machado’s (past the 6^{th} inning with at least a 6 run lead when the previous batter hit a home run) vs. all other times:

HBP% | First Pitch HBP% | |

Past the 6^{th} Inning, with previous batter hitting a home run with a 6+ run lead | 1.348% | 0.054% |

All other situations | 1.019% | 0.018% |

In situations where a player broke even just a portion of the rule Tatis broke (as I am not even considering 3-0 counts here), the following batter gets hit more, especially first-pitch hit-by-pitches (risk ratio > 3). This seems like strong evidence that teams see Tatis’s actions as wrong.

**The Model**

For this post, we will create 2 models, 1 that estimates the probability of the next plate appearance in general resulting in a hit-by-pitch, and 1 that estimates the probability of the first pitch of the next plate appearance being a hit-by-pitch. The idea behind this is that we have evidence that the latent variable of breaking an unwritten rule is associated with the opposing team choosing to hit a batter right away as a sign of protest. While the latent breaking of unwritten rules is not the sole cause of hit-by-pitches, it may cause HBP frequently enough to notice patterns.

To choose an appropriate functional form here, it is important to remember our objective: detecting a set of rules. As a result, we will use decision tree-based modelling, which naturally allows for us to make predictions based on a plate appearance violating a set of threshold-based criteria. Decision tree modelling has become especially popular after The Top 10 Algorithms in Data Mining put 2 popular decision tree models, the C4.5 algorithm, and the CART algorithm, as #1, and #10 in the list, respectively. Recently, the New York Times posted a lovely example of decision tree modelling to predict one’s political party affiliation.

Specifically, I use the CART algorithm here, via the `rpart`

package. While there are many extensions of CART (such as bootstrap aggregation, random forests, gradient-boosted models, and Bayesian additive regression trees) by growing many trees that can focus on different variables or different portions of the data, CART is very interpretable, and for the purpose of precisely defining a set of rules, we care much more in extracting an accessible rule rather than increased predictive performance here.

I use 2018 Retrosheet data to fit the model and 2019 data for testing. I include the inning, the score differential, the last 3 event types (e.g., HR), the current base state, the current number of outs, the balls, and the strikes as covariates in this model for the generic HBP, and use everything but balls and strikes for the first pitch HBP (because all first pitch HBP tautologically have 0 balls and 0 strikes before the event). Below is my code for each regression:

I limit the size the tree depth to 12 (so every rule will have at most 12 criteria). Since we are considering 9 variables, that can allow for the use of multiple criteria on the same variable (e.g., this unwritten rule occurs after the 2^{nd} inning, but before the 7^{th}), but not guaranteeing so.

**Results**

We plot the Receiver Operator Characteristic (ROC) Curves to evaluate quality of fit for the training and testing datasets for both response variables (HBP, and First Pitch HBP) using the `roc`

package to create `roc()`

objects for each model and dataset, and the `ggroc()`

function from the `pROC`

library to plot:

In an ROC curve, the further we diverge from the black line on the left-hand side, the better, the more we deviate from random guessing. We can evaluate the quality of each model further by calculating the Area Under the Curve (AUC), where 0.5 is equivalent to random guessing (i.e., the area underneath the black line in our plot), and 1 represents a perfectly calibrated model. We can see visually that both HBP models are only slightly better than random guessing, with the generic HBP performing better (AUCs of 0.71 in training, 0.61 in testing) than the first pitch HBP model (AUCs of 0.70, 0.53). Unfortunately, it looks difficult to predict HBP, and thus will likely be hard to extract latent unwritten rules using this approach.

Even though our model only weakly predicts HBP in the test set, we can still examine whether there are any rules that seem especially influential within that weak signal. To examine the decision tree model, we can plot the tree, via the `rattle`

package in R:

The blue nodes represent leaves that have more non-HBP than HBP, and green represents more HBP than non-HBP (with darker shading indicating more uniformity). To detect simple enough rules to remember and pass on over time, we would want green nodes towards the top of the tree. There is 1 green node towards the top of the graph, but otherwise, most of the paths that lead to heavily concentrated HBP bins are likely too complex to be unwritten rules. The criteria for this simple enough rule are: (1) the previous event is an interference play (2) after the 7^{th} inning and (3) with at least 1 ball. My guess is that this is just a sparsely observed event rather than an event which is actually insightful about unwritten rules.

**Additional Thought:**

- A new development (paper came out in the past month!) for decision tree models is decision tree models for sparse data, which could be especially useful for this kind of problem where our measure of interest occurs 0-2% of the time. Unfortunately, there is no R implementation yet, you can follow the github repository for it here.