What is the secret to winning your NCAA March Madness bracket pool? There isn’t a surefire way, but you can get pretty close using machine learning to give you the edge.

For years I have filled out a bracket for March Madness, with little success. I have tried countless strategies: picking upsets, picking favorites, following so-called “expert” picks. None of these methods led me to success in my bracket pools.

A couple years ago I decided to try something different—to use data from prior NCAA games to create a machine learning algorithm for predicting which team will win each game, immediately turning fortune to my favor. I started contending and even winning many of my tournament pools. Creating a model to predict every game may seem complicated, but it can be much easier than you think. In this blog, I’ll cover all the steps needed to finish first in your March Madness bracket pool next year.

What is March Madness and How is it Typically Ranked?

March Madness refers to the men’s and women’s NCAA basketball tournaments that occur over a few weeks every March. Sixty-eight teams make the single elimination tournament; one bad game and your hopes of a championship vanish. Teams are seeded based on their play during the regular season, but upsets happen quite frequently giving it the name March Madness and making it infuriatingly difficult to predict. That certainly does not stop people from trying. Every year, millions of people participate by filling out their own bracket—attempting to predict all 63 games—and enter their selections in bracket pools for a chance to earn ultimate bragging rights.

Methods on how to fill out your bracket typically range from experience in watching many of the teams play throughout the year to going with your “hunch” or choosing which school’s colors you like the best—there isn’t any one standard way of doing it. What makes March Madness so great is that anyone can enter and have a chance at winning their pool. But what if there were a more reliable way? What if you can make your picks by knowing the most important factors in predicting who will win—and deciding those factors based on data science? Historical data and machine learning can help you do just that and perhaps even predict a big upset.

What is Machine Learning and What Makes March Madness a Good Use Case for It?

Machine learning, at its simplest, is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns, and make decisions with minimal human intervention. There are several different types of machine learning, but the most common type is supervised learning, which essentially is predicting a target based on labeled data. The two types of machine learning are classification and regression:

  • A classification model puts data into categories or classes based on what it learns from historical data and is typically used to answer yes/no questions.
  • A regression model is used to determine the best fit of predictor values and target values in determining predictor strength, forecast over time, or a cause-and-effect relationship.

Real-world examples of classification include predicting if a customer will leave a subscription service or when a piece of machinery needs maintenance. With a proper amount of historical data, the machine learning model can far outperform the previous methods for predicting these events. Knowing that there is a plethora of college basketball data readily available, I decided that predicting the NCAA Tournament would be a great use case for a machine learning.

In the case of predicting individual games for the NCAA Tournament, I used a classification model because we want to predict a categorical outcome: Which team will be the winner—Team A or Team B.

How Do You Use Machine Learning to Predict the NCAA Tournament?

The first step when beginning a machine learning project is selecting the data to be used. In this case I used a dataset provided by the data science website Kaggle.com which contains game-by-game data for every college basketball game since 2003. The dataset includes standard statistics found in a box score for both the winning team and the losing team. There are other possible datasets that could work as well for this project, but I have found using game-by-game data to be relatively simple while still leading to a very accurate machine learning model.

From those basic measures, I can derive more advanced stats as well. For example, I can use the standard Field Goal statistics to create True Shooting Percentage, which is a more accurate way of measuring shooting efficiency. For this example, I used a combination of standard and advanced statistics for each team. Once I had statistics for each game in the dataset, I calculated each team’s average of for the previous 14 games. This provides a snapshot of each team’s current form for every game, which I will also use for the predictions for the NCAA Tournament games.

In addition to using statistics for each team, I also created an Elo rating system—a method for calculating the relative skill levels of players in zero-sum games—for each team, which is a rating system updated on a game-by-game basis showing how powerful a team is at the time of each game. It considers:

  • How much a team won or lost by,
  • Where the game was played, and
  • What was the opponent’s rating going into each game.

Combined with the raw statistics, I now have a very good overall view of each team. With this dataset, I can narrow down the features that I will use for my machine learning model. Having too many features in a model can lead to overfitting—being too specific for the training dataset such that it does not perform well on new data. One rule of thumb is to avoid features that are too highly correlated, because the model may interpret this as being overly important. For example, you would not want all three offensive rebounds, defensive rebounds, AND total rebounds. In this case I used offensive and defensive rebounds and that way, in effect, I also have total rebounds. With the dataset ready, it is time to start building a model.

How Do You Build a Machine Learning Model for March Madness?

Again, since my goal is to predict which team will win each game—Team A or Team B—I will use a classification model to build my machine learning model.

You may have been wondering what purpose games from 2003 serve to predict games from 2021? Well, this is where training your model comes into play. You send as much data as possible through your machine learning model to tell the algorithm how your features can be used to predict who will win. This can be done because in this case, I already know the result of each game. Take for example a game between Duke and North Carolina in 2003. Duke won—but why did they win? Perhaps they had better stats leading into that game and had a better team rating. Doing this over and over many thousands of times, and you will start to get a pretty good idea of which team attributes contribute most to winning a game over another team.

After you have a trained model, you can use a subset of your data to test your model. This will allow you to experiment with different models, feature sets, and small variations of each model to see which performs the best. Each variation can be scored with an appropriate evaluation metric until the ideal performing model is selected.

There are many platforms to choose from to build your machine learning model, but I chose to use Python because of its large machine learning library as well as its flexibility in creating new features. Those less familiar with programming and machine learning may consider using an AutoML tool such as DataRobot or an open-source, low-code tool such as PyCaret (or Caret in R).

Putting Your March Madness Machine Learning Model to the Test

Once the model is built, you can create a dataset to run through your model to get the results of each possible matchup for all 68 tournament teams. Of course, this dataset is different from the training and test datasets as they will not have a target (win/loss) column. For example, one row of the data would have both Oklahoma and Gonzaga, with each of their respective team stats and rating. The model will then predict, given both teams’ statistics, which team is more likely to win that game. In this case, my model predicted that Gonzaga would win with a 78% probability. In the actual game, Gonzaga won by 16 points, so not bad!

The machine learning model I created predicted three of four of the final four teams (Gonzaga, Baylor, and Houston), including the national championship game. Unfortunately, the model (incorrectly) predicted Gonzaga would beat Baylor in the national championship game, with a 62% probability. I have used different drafts of this machine learning model for the NCAA Tournament since 2018, and this is the first tournament that it did not correctly predict the national champion. This year my model finished 5th in the Analytics8 office pool, won the show in 2019, and has finished at least in the 93rd percentile in ESPN’s tournament challenge every year.

One of the lessons learned is that, to win your pools, you do not have to get every upset correct – a perfect bracket is unnecessary and nearly impossible. What is more important is to pick the correct teams to get deep in the tournament. Identifying those teams is no easy task but relying on data and machine learning as we have shown here can give you a much higher probability of achieving that goal.

While it is fun to pick each game based on gut feeling or which jersey color is best, removing our biases and letting past data drive predictions will give the best chance of high accuracy. This is certainly not going to predict every upset, but rather what is likeliest to happen. Done over several years, you can have decisively superior results compared to conventional methods.

This same methodology applies to countless use cases. With the automation and ability to identify patterns of machine learning, there are certain use cases that simply cannot be predicted as well using conventional methods. As we have shown here, machine learning is not something reserved for mathematicians. With some basic technical experience, you can begin building your owns machine learning models and see for yourself how powerful the results can be.

Eric Morrell Eric Morrell is a consultant based out of our Chicago office. He specializes in analytics projects using a wide variety of tools to help companies get value out of their data. Outside of work he enjoys golfing and using machine learning to help win his fantasy sports leagues.
Subscribe to

The 8 Update

Sign up to receive our monthly newsletter, and get the latest insights, tips, advice.

Thank You!