Introduction
The 2023 Rugby World Cup is finally here! Hosted by France, this year current Champions South Africa will be looking to defend their title. The current book maker favourites to win the tournament are New Zealand, France, South Africa and Ireland.
This blog explores using probabilistic simulation to understand which teams are likely to win the pool stages and progress to the knock-out rounds. Given the sparsity of available information we do this by modelling the win-rate (i.e. fixture win percentage) of a given side and use random sampling to simulate fixture outcomes.
Interactive Dashboard available here
Code available here
Why use Simulation?
Simulation can be a powerful tool when working with sparse data. Unlike club level rugby, where leagues run each season with sides playing each other regularly, international rugby matches are played less frequently and with a notable divide between the ‘Tier 1’ and ‘Tier 2’ nations.
The ‘Tier 1’ nations tend to be large national sides which play every year in prestigious competitions such as The Championship and The Six Nations. The ‘Tier 2’ sides play infrequently, in some cases with semi-professional sides and rarely have an opportunity to play against ‘Tier 1’ sides.
The Rugby World Cup is played in two stages, firstly pools where teams are assigned by seeding and then a knock-out competition where the top two sides from each pool enter the Quarter-Finals. Outside of the Rugby World Cup, ‘Tier 2’ sides rarely get to play international test fixtures against ‘Tier 1’ sides.
Because of this, predicting likely outcomes in the tournament is difficult – since theoretically anyone can win it, but we have very few observations from past matches between sides.
Hierarchical Bayesian Models
Bayesian statistics is a branch of statistics which combines prior beliefs, or information, with observed data to make inference over parameters of interest. It differs from traditional frequentist statistics by updating prior beliefs based on evidence. This makes it well suited to statistical tasks which seek to quantify beliefs or quantify uncertainty in data.
A key feature of Bayesian probability is that parameters of interest are treated as random variables. Thus, they have a distribution which we can draw samples from. The updated beliefs, known as the posterior distribution, can be characterised as:
p(θ|y) ∝ p(y|θ)p(θ)
Where θ represents our parameter of interest and y represents our observations. It is often not feasible to calculate this directly, hence we typically use an approach such as Markov Chain Monte Carlo (MCMC). This approach uses random sampling to iteratively adjust our belief on the shape of p(θ|y) over many samples. Samples taken from the converged distribution give us a direct sample from what we approximate as the posterior distribution.
A hierarchical Bayesian model is one which has multiple levels to it where the model has shared parameters instead of constants as priors. Each lower level of the hierarchy comes from a shared population parameter.
Mathematically, we can characterise this as:
p(ρ,θ|y) ∝ p(ρ)p(θ|ρ)p(y|θ)
Where ρ represents the common population parameter and θ represents a vector of lower level parameters.
A benefit of using hierarchical models is that inter-dependencies and relative differences between lower levels in the hierarchy are learned directly. As well as this, estimates over one of the lower-level parameters simultaneously updates our beliefs on other members of the lower-level hierarchy. Its because of this that sports data is particularly well suited to hierarchical models. In the context of modelling the win-rate, our higher level of the hierarchy is an assumed population average win-rate and the lower-level is the individualised parameter distributions for team level win-rate.
One drawback of hierarchical models (within the context of sports analytics) is the problem of shrinkage. This is where teams who are at the extreme (i.e. top or bottom performance) tend to have their estimates shrunk towards the mean. This is a particular problem in this exercise since we have fewer observations for ‘Tier 2’ sides, and in many cases the model appears to overestimate the performance of these sides.
Ultimately, we are using Bayesian analysis here for uncertainty quantification. The more observations we have, the more certain we can be about the true underlying distribution of our random variables. Unfortunately, for sides with fewer observations we are more uncertain about performance. This is where simulation helps to stabilise our model – over many simulations on average the high performing sides consistently win a higher percentage of the fixtures leading to a more reasonable estimate of tournament performance.
That said, this is also the beauty of sport – in theory, anything can happen!
Project Workflow
The workflow for this project can be described as below:
Here we combine two models. Firstly, we estimate the team level win-rate in order to generate a large sample from the posterior distributions. Secondly, we use these samples to iteratively simulate through the tournament. Since we do not have information available on England versus Chile, for example, we instead use the win-rates for each side to simulate the fixture winner. We do this by taking a sample from the distribution for England’s win-rate, and taking a sample from Chile’s win-rate. We then compare the sampled win rates and allocate the winner as the side with the higher win-rate. Put simply, it’s a bit like statistical Top Trumps!
Gathering and Cleansing Data
Data was scraped from web pages and then parsed using the Python package BS4. For ‘Tier 1’ sides we get fixture results for the last 18 months, whereas for ‘Tier 2’ sides we extend this to fixtures going back to the last Rugby World Cup in 2019.
Once data is parsed and cleansed (using regular expressions) we use a combination of the SAS DataStep and FedSQL to pre-process the data. Finally, we wrangle the data such that we have the win-rate from fixture outcomes for both ‘Tier 1’ and ‘Tier 2’ sides where the win-rate is presented for the ‘home side’. Home advantage is not considered in the model as only the host nation has a true home advantage. Instead we flip the home and away teams to add more information to the model. For example, below, we have an equivalent row for France v Italy as Italy v France and a win-rate of 0.
Building our Probability Model
The hierarchical Bayesian model is built using the Python library PYMC. The model prior for win-rate is a Beta random variable with hyperpriors for α and β. The prior is then used as the probability of a Bernoulli random variable with observations being the observed win-rate.
The posterior samples show that the Highest Density Interval (HDI) for the win-rate have wider intervals for teams where we have fewer, or lower quality, observations. This is where the model may overestimate the performance of sides like Portugal. Looking at their distribution interval there are situations where they have a higher win-rate than the book makers favourites to win.
Improving our Model
As the tournament progresses, we’ll have more information with which we can update our beliefs. I hope to re-run this model as we exit the pool stages. The tournament is approximately 2 months long, re-running the simulations with more observations (and possibly trimming out some older observations) may lead to a more stable model.
Creating our Simulation Model
The simulation model itself is a humble Python function. For each iteration we simulate the pool stages, rank and select the top two sides, then begin the knock-out stages. For each iteration a simulated tournament is created, returned and appended to a master dataset.
Preparing Simulation Data for Visualization
The simulations are recorded in a long format. We firstly do some post-processing to create flags for winners at the pool and knock-out stages. This allows us to visualise the summary statistics on likely winners of each stage.
We also prepare the data for a Path Analysis. Visualising the pathways for teams via a Sankey Diagram is a really rich way of understanding the possible tournament outcomes for a given side. To do this we transpose the data then collapse it back into a long format with PROC TRANSPOSE and the DataStep.
Visualizing our Simulations
Finally, we’re able to bring our simulations to life using SAS Visual Analytics. The report is simple, using only the Path Analysis and Bar Chart visualizations. The report is then exported as a Report Package where it is then available as an interactive offline report. I then deployed this onto cloud infrastructure to make it available for others to access.
Analysis
So, can we use these simulations to learn anything about who is likely to win the tournament?
Likely Winners
The simulation model has France as winners ~70% of the time. New Zealand come in second with ~14% of wins. Other strong candidates include Ireland (~9.4%) and Argentina (~6%).
Likely Runner-Up
Whilst France are clear favorites in the model to win, second place is more tightly contested. New Zealand come out top with ~38%, followed by Ireland with ~26% of simulated outcomes.
Likely Third Placed
Surprisingly, Ireland (recently ranked #1 in the world) do not feature in the top 5 for coming 3 rd . Argentina win bronze place in ~47% of simulations. Wales (~16%) and Georgia (~17%) are also highly ranked, however this may be where the model is falling victim to uncertainty. Wales have had a dip in performance recently (replacing Wayne Pivac with Warren Gatland less than a year out from the tournament start). With historic strong performance this leads to a reasonably wide interval of values for win-rate. Likewise, Georgia (who are a very strong ‘Tier 2’ side) have been very impressive in recent months but still do not have many opportunities to play ‘Tier 1’ sides. This may lead to over confidence in their win-rate as they do not often play the very top sides, but win many of the fixtures that they do play.
Likely Semi-Finalists
The first Semi-Final is played by the winners of Quarter-Final 1 and Quarter-Final 2. These are made up of two of the sides from the winners of Pool C and B and the runner-ups of Pool D and A. The simulations have New Zealand (~37%), Ireland (~29%) and France (~18%) as winners of Semi-Final 1. This would suggest that the model expects the runner-up of Pool A to have a higher chance of winning Semi-Final 1 than the winner of Pool B, which contains South Africa and Ireland.
South Africa in particular have been surprisingly under represented in the model. One reason for this is that Romania has been potentially over-represented as likely winner of Pool B. The runner-up of Pool D does not feature in the top 5 teams most likely to win the first Semi-Final.
The second Semi-Final is made up of the winners of Pool D and A or the runner-ups of Pool C and B. France (~70%) are overwhelming favourites to win Semi-Final 2, which is the scenario where France wins Pool A and plays the runner-up from Pool B. Argentina (~14%) and New Zealand (~10%) are the other sides likely to win the second Semi-Final. This is the scenario where Argentina wins Pool D or New Zealand wins Pool A.
Likely Quarter-Finalists
The further back into the tournament we go, the more we see an over-representation on the likely performance of ‘Tier 2’ sides. Quarter Final 1 the favourite side to win is Georgia. That said, this is not an impossible scenario. Australia have had a run of poor form and have taken a very inexperienced side to the world cup leaving stars like Quade Cooper and Michael Hooper behind. Wales have also had a run of poor form going into the world cup. Ignoring Chile as an anomaly (we created a synthetic variable for Chile as we had no data) the four teams most likely to win Quarter-Final 1 are Georgia, Wales, England and Argentina. It is more likely that England feature as likely runner-up of Pool D.
Quarter-Final 2 is played between the winner of Pool B and the runner-up of Pool A. The winner of this round is most likely to be New Zealand (~39%), Ireland (~31%) and France (~16%). This suggests that Ireland are most likely to win Pool B and if they win Pool B they stand a strong chance of winning in the first Quarter-Final.
Ireland have never gotten past the Quarter-Finals in a rugby world cup and the model suggests winning Pool B puts them in the best chance of getting to the Semi-Finals. In the scenario where they come as runner-up of Pool B they’re given ~5% chance of reaching the Semi-Finals given that they’ll have to play the winner of Pool A which is likely to be France or New Zealand.
The most likely winner of Quarter-Final 3 is Argentina (~63%). This is where the winner of Pool D plays the runner-up of Pool C. Pool C is arguably a weaker pool based on recent form of Wales and Australia suggesting that the winner of Pool D has a very strong chance of making the Semi-Finals.
Finally, Quarter-Final 4 is dominated by the winner of Pool A with France (~72%) and New Zealand (~13%) strong favourites. This suggests that the runner-up of Pool B has a relatively low chance of making the Semi-Finals.
Pool Performance
The model suggests strong favourites to win each pool with France (~81%), Ireland (~64%), Georgia (~48%) and Argentina (~87%) winning Pool A, B, C, and D respectively. The predictions for Pools A and D seem sensible, though I would expect New Zealand to have a higher chance than this, however Pools B and C both over-represent the performance of ‘Tier 2’ nations. I do think Georgia have a strong chance of doing well in the pools, and Georgia did beat Wales for the first time this year however there is a high level of uncertainty in this pool. Pool B, on the other hand, vastly overestimates Romania and puts them ahead of South Africa.
There is less certainty in the model on who is likely to be runner-up in the pools. Aside from New Zealand (~60%) in Pool A, all other pools pick the runner-up less than 50% of the time. Its clear that Ireland are favouirtes to win Pool B, according to the model, with both South Africa and Scotland likely runner-up. Pool C is between Wales and Australia as runner-up and Pool D (again, ignoring Chile) has England (~36%) followed by Samoa (~21%) as runner-up.
Possible Surprises
I’ll be very interested to see how Georgia fares in the pools. Clearly, there is over-representation of ‘Tier 2’ sides however Georgia have shown a lot of improvement in recent seasons and have had some big victories this year including wins over Wales, USA and Italy recently.
Argentina will be strong favourites to win Pool D, and could be a surprise in the knock-out stages this year. The model has them as favourites to come third overall.
Ireland, despite being ranked world number one recently, could have a surprise exit at the Quarter-Finals if they don’t win Pool B given how stiff competition is in Pool A.
Final Predictions
Using the model here are the predictions for First, Second and Third are:
Winner – France
Runner-up – New Zealand (though Ireland are a close second)
Third Place – Argentina
Conclusion
We’ve seen how a combination of probabilistic analysis and simulation can be used to set expectations for performance in the Rugby World Cup despite having only sparse observations readily available. The model has picked clear favourites in France, New Zealand and Ireland whilst unrealistically writing off South Africa. We’ve seen that Argentina can expect to do very well, but have less than a 5% chance of winning the tournament overall.
We’ve seen that the model, whilst imperfect, does appear to capture recent form of sides. The model suffers from a lack of information for ‘Tier 2’ sides, often leading to an overestimation of performance.
Once the pool stages are complete I plan to re-run the model to see how it updates the expected results. I expect that, with many ‘Tier 2’ sides likely to not win the pool we will see more realistic estimates for South Africa.
If you enjoyed this…
Check out some of my other blogs which explore sports, simulation or Bayesian analysis:
Selecting the starting line up of the British & Irish Lions using optimization techniques: https://communities.sas.com/t5/SAS-Communities-Library/Using-SAS-Viya-to-Select-the-British-amp-Irish-Lions-Rugby-Team/ta-p/755754
Finding a sunken vessel using Bayesian Search Theory: https://communities.sas.com/t5/SAS-Communities-Library/Finding-a-Needle-in-a-Haystack-Bayesian-Search-Theory-with-SAS/ta-p/886204
Understanding the likelihood of children sharing the same name at school with Bayes Theorem: https://blogs.sas.com/content/hiddeninsights/2022/09/06/what-are-the-odds-will-your-child-have-the-same-name-as-their-classmates/
Did you know?
France rugby use SAS Viya to boost team performance with AI & Analytics? Read more here: https://www.sas.com/en_us/customers/french-rugby-federation.html
... View more