# Predicting March Madness

#### Introduction

Bowling Green State University is transitioning to online teaching starting next week. This is a little sad since I enjoy teaching classes face-to-face and it will be challenging to replicate the usual class atmosphere using conferencing software. One of my student groups (specifically my ACTION research group) were working on particular methods for predicting first round games for March Madness and they were planning on entering their rationale and first-round picks for a contest sponsored by the American Statistical Association. Obviously since the tournament has been cancelled, we won’t be able to check the accuracy of our predictions. But I thought this might give me an opportunity to explain one “random” method for predictions and we can see how this method performs using results from the 2019 first-round games.

#### The Rules

This particular March Madness “Pick ‘Em Upset Challenge” was based solely on predictions of the 32 first-round games of the tournament. Here are the contest rules. One get 2 points for each correct pick. In addition, you get bonus points for picking upsets. The bonus points are the difference in the seeds of the two teams. For example, if you pick a 10 seed to defeat a 7 seed and the 10 seed did win, you would get 2 points for the correct pick and |10 – 7| = 3 bonus points. The objective is to get the most total points.

#### Pick the Favorites

One prediction strategy would be to simply pick the higher seed. In the 2019 tournament, 20 out of the 32 matches were won by the higher seed. So you would score 2 x 20 = 40 points using this “pick the favorites” method. Note that you wouldn’t get any bonus points since you are always choosing the favorite (higher seed) to win.

#### Using Other Information

Obviously one can use many types of information in devising an alternative prediction strategy. What if we simply using historical data on contests between various seeds to inform us on what to choose? Using data from 1985 to the current day, I collected all of the results of the first-round matches of March Madness. This table shows the matchup of seeds, the number of times the higher seed won (H), the number of games won by the lower seed (L), the difference in seeds (D), and the proportion of games won by the higher seed (P).

#### A Random Prediction Strategy

Using the historical data, here is a simple prediction strategy. Suppose a 6 seed plays an 11 seed — we see that the chance that the higher seed wins is 0.629. I generate a random number between 0 and 1 — if the number is smaller than 0.629, I predict that the higher seed (6) will win; otherwise I predict the lower seed will (11) win. I use this method for all 32 games using the probabilities P — by generating 32 uniform numbers, I get selections for all of the games.

One attractive feature of this method is that it will generally predict upsets which could potentially lead to higher scores that use the bonus points. I am not sure if I will predict more winners, but the bonus points will likely give me an edge over the Pick the Favorites strategy.

#### 1000 Predictions

Using R, it is pretty easy to write a short function one_pick() to make my predictions. As you see I generate 32 uniforms using runif() and compare these with the historical win probabilities. I compute the number of correct picks using the results from the 2019 contest. My score is twice the number of correct picks plus the bonus points for predicting upsets.

I am interested in exploring how this random strategy will perform over many predictions. I use the replicate() function to run this simulation for 1000 trials and I construct a histogram of the number of points scored in these 1000 trials. I show the number of points from the Pick the Favorites strategy using a red vertical line.

What have we learned?

- This method tends to be substantially better than the Pick the Favorites strategy. On average, I get about 60 points which is 20 points higher than the 40 points that I got using the naive strategy.
- But there is a lot of variability in the performance of my prediction method — I can score between 30 and 100 points, although most scores fall between 50 and 70.
- Other tournaments? It would be interesting to apply my method to other seasons, but I suspect that we would see similar results.

#### Going Forward

- Sorry that I made a basketball post, but I think this post will be helpful for my ACTION students (or any other people) who are working on a March Madness project.
- Moving forward, I am thinking of working on a set of “Getting Started with R” posts where I will illustrate some base R work using baseball data. I know that many baseball fans struggle with the first steps of using R and hopefully this material will provide some help and encouragement to these folks.

## Recent Comments