A team led by Andreas Groll at the Technical University of Dortmund in Germany have combined machine learning and statistical analysis to identify who they think will be the most likely winner of the World Cup 2018.
The team simulated the soccer tournament 100,000 times and used three different modeling approaches based on performances in all previous matches from 2002 until 2014.
Their paper, published this week, outlines the details of the technique they used called the random-forest approach, a recently developed method for analyzing large data sets based on the concept that a future event can be predicted. To make its forecasts this method uses a complex and intricate decision tree where a potential outcome is estimated at each branch by reference to a set of training data.
Most decision trees, however, are burdened in later stages of the process by unreliable decisions that are distorted due to inconsistent and scattered training data, a condition called overfitting. The random-forest process avoids this issue by calculating the outcome of random branches multiple times, each time with a different set of randomly selected branches.
The outcome produced by this method is an average of these many random decision trees and therefore successfully bypasses overfitting while at the same time revealing what elements were most important in its creation. Groll and his team used this approach to ensure as many potential factors that might determine the outcome of the game were included as possible.
The researcher's model included obvious elements such as FIFA’s rankings and relevant team statistics, such as average age and number of Champions League players. However, the model also went so far as to include other less directly-related elements such as the countries' population rates and GDP and even the coaches' nationalities.
Best-performing prediction methods
The team then extrapolated the best-performing prediction methods and combined them in order to "improve the predictive power substantially." "Finally, this combination of methods is chosen as the final model and based on its estimates, the FIFA World Cup 2018 is simulated repeatedly and winning probabilities are obtained for all teams," states the paper.
The process, in the end, picked Spain as the most likely winner with a 17.8 % probability of success and a 73% chance of reaching the quarter-finals. However, the researchers added that if Germany were to clear the group phase of the competition, its chances of reaching the quarter-finals would increase to 58%.
"The model slightly favors Spain before the defending champion Germany. Additionally, we provide survival probabilities for all teams and at all tournament stages as well as the most probable tournament outcome," concluded the paper.
If the results prove true at the World Cup 2018, the study would introduce a whole new industry for machine learning to conquer. The new method may even see bookmakers become obsolete.