I am working on a method for determining the efficiency of a team in using its on-base events to score runs. Some baseball teams tend to scatter their on-base events through the nine innings and score few runs. (I am thinking of the 2016 Phillies, but maybe their problem is simply not producing many on-base events.) In contrast, other teams seem to be effective in bunching their on-base events to produce runs. In working on this method, I wanted to write a function to simulate the number of runs scored in a half inning given the overall probabilities of the six basic events OUT, SINGLE, DOUBLE, TRIPLE, HOME RUN, and WALK/HBP, and some basic runner advancement rules. Anyway, I thought it was a good topic for a post.

#### Runner Movements

In my simulation, I wanted to use realistic movements of runners on base on base hits. For example, if there is a runner on first, then sometimes a single will advance runners to first and second, and sometimes you’ll have runners on first and third. Using the 2015 Retrosheet data, I found the frequencies of the different outcomes which can be converted to probabilities. For example, when there is a runner on 1st, a single advanced the runner to 2nd 3940 times, and advanced the runner to 3rd 1411, so the probabilities of the final states “runners on 1st and 2nd” and “runners on 1st and third” are approximated to be 3940 / (3940 + 1411) = 0.736 and 1411 / (3940 + 1411) = 0.264. In a similar fashion, I found the runner advancement probabilities for singles and doubles in all runner situations. Here is the runner advancement matrix for a single. (My notation is that “110” means runners on first and second.)

000 100 010 001 110 101 011 111 000 0 1.000 0 0 0.000 0.000 0 0.000 100 0 0.000 0 0 0.736 0.264 0 0.000 010 0 0.576 0 0 0.000 0.424 0 0.000 001 0 1.000 0 0 0.000 0.000 0 0.000 110 0 0.000 0 0 0.359 0.239 0 0.401 101 0 0.000 0 0 0.766 0.234 0 0.000 011 0 0.518 0 0 0.000 0.482 0 0.000 111 0 0.000 0 0 0.344 0.223 0 0.433

#### The Probabilities of the Six Basic Events

Also to make my simulation more realistic, I used the frequencies of the six basic events from 2015 data.

OUT BB 1B 2B 3B HR 0.6817 0.0863 0.1543 0.0454 0.0052 0.0270

Note that I’m ignoring other events such as steals, sacrifices, wild pitches and passed balls that help to produce runs, but I’m interested in how well I can estimating run scoring just using this information.

#### The Simulation Function

On my gist site, I have two functions — the function ` runs_setup `

defines the runner advancement matrices and a vector that gives the probabilities of the six basic events, and the function ` simulate_half_inning `

does the actual simulation. Here’s the crux of the actual simulation. Using the ` sample `

function, I simulate an event. I simulate the runner advancement — sometimes, the advancement is deterministic (for example, the bases will always be “001” after a triple), but otherwise I simulate the runner advancement using the ` Prob_Single `

and ` Prob_Double `

matrices. I update the number of outs and runs scored, and then repeat until we have three outs.

while(outs < 3){ event <- sample(names(setup$prob), size=1, prob=setup$prob) if (event=="1B") new_bases <- sample(all_bases, 1, prob=setup$Prob_Single[bases, ]) if (event=="2B") new_bases <- sample(all_bases, 1, prob=setup$Prob_Double[bases, ]) if (event=="BB") new_bases <- sample(all_bases, 1, prob=setup$Prob_Walk[bases, ]) if (event=="3B") new_bases <- "001" if (event=="HR") new_bases <- "000" if (event=="OUT") new_bases <- bases outs <- outs + (event == "OUT") runs <- runs - (event == "OUT") + runs.transition(bases, new_bases) bases <- new_bases }

#### How Do the Simulation Results Compare with Real Baseball?

I use the ` runs_setup `

function to do the setup, and the ` replicate `

function together with ` simulate_half_inning `

to simulate the runs scored for 10,000 half-innings:

st <- runs_setup() R <- replicate(10000, simulate_half_inning(st)) round(prop.table(table(R)), 3) sim.R 0 1 2 3 4 5 6 7 8 10 12 0.762 0.123 0.060 0.033 0.014 0.005 0.002 0.001 0.000 0.000 0.000

In these simulations, there were 0 runs scored in 76.2% of the half-innings, 1 run scored in 12.3% of the half-innings, and so on.

I’m interested in how the simulated results compare to the real run production during the 2015 season. I’ve graphed the proportions of 0, 1, 2, … runs scored (in a half-inning) using the simulated and actual data. Actually, I have plotted the logit probability which is log (prob / (1 – prob)) — this makes it easier to visualize small probabilities.

Looking at the graph, it seems that the simple model predicted a higher proportion of 0-run innings and predicted a lower proportion of 1-run innings. That makes sense, since the model is ignoring events like stolen bases and sacrifice runs that produce a single run. It is interesting that the model seems good in predicting the proportion of runs scored for 2 runs and higher.

In my work, one common theme is that simple models tend to be remarkably accurate in predicting the variability in baseball data. This simulation model is certainly wrong (for example, I was ignoring the variability in run scoring between teams), but the model seems useful in understanding the variation in run scoring.

Awesome work, Jim. Are the simulated points for 3 > runs < 12 hiding perfectly behind the actual points?

Josh, yes they are — I should probably redraw so that one set of points is not hidden.