For this project I was only able to use open datasets readily available on the internet. Performance statistics were gathered at a player level for the Guinness Six Nations Championship for the last three years (2021-2019). The Six Nations is an annual international competition between the top Northern Hemisphere sides: England, Wales, Scotland, Ireland, France and Italy.
Most players selected for the Lions had played at least one Six Nations, though Sam Simmonds was a notable absentee since he last played for England in 2018 shortly before a serious ACL injury.
As well as international performance stats it was important to bring in domestic league performance. I gathered domestic league data for the current season for the English Gallagher Premiership and the United Rugby Championship (previously the Pro14). These two leagues covered teams from the four home nations and almost all players were included, one noticeable exception being Finn Russell who plays in the French Top 14.
I further augmented this data with some open datasets from sites like Wikipedia for individual player bios (height, weight, position, etc.)
Working with public data has its limitations, and I feel these limitations do impact the model decisions to an extent. The limitations here are:
It is also worth noting that the cut of data that was taken from public sources was a static snapshot in May, before the end of the season. Therefore, there are one or two matches worth of data missing.
Rugby is also an incredibly physically demanding game, and because of this players are regularly injured. The model and dataset is based on the original squad that was selected by coach, Warren Gatland. The model itself is flexible enough to change the list of players to consider, you only need to edit the CSV source datasets – but this example was done some time ago now and I am only just writing it up.
It is because of this that some of the new additions to the squad as injury cover (e.g. Marcus Smith) are not in the model options. Likewise, some players who are no longer in contention (e.g. Finn Russell) still get selected.
Even as I write this blog the squad selections are fluctuating, there are players who are being considered for Saturday – Alun Wyn Jones became injured and has amazingly recovered. Whether he recovers enough in time to regain his starting spot will be interesting to see.
You can view the raw datasets here:
Before we go into depth about building the model it is worth doing a brief Exploratory Data Analysis to look at the data we have available.
Figure 2 shows the available squad of 37 players selected for the Lions.
Figure 2 - Squad of Eligible Players for Selection
The squad of 37 players many can cover multiple positions. In Rugby the shirt numbers 1-15 are related to specific field positions. Positions 1-8 are ‘the forwards’, they focus on defensive and physically demanding roles to gain field territory. Positions 9-15 are ‘the backs’ they are the faster players who generally focus on attacking play, trying to use their speed to break through the defensive line and score tries (points).
Positions 1,2,3,9 and 10 are generally regarded as specialist positions – looking at Figure 3 we can see that these specialist roles only have 3 cover options available. Other positions have multiple players in contention, so we are not only selecting the best players, but the best players at a position level. This is interesting because you need to consider how to allocate a player who may be the best player available for more than one position.
Figure 3 - Position Coverage
Figure 4 shows that there is a reasonably even representation of players across the four home nations. It is interesting to note that England has the largest representation with 11 players when the England side had a relatively poor Six Nations.
Figure 4 - Representation by Home Nation
Looking at a club level Figure 5 shows that most clubs have no more than 4 players, and only 16 clubs have players selected in the squad of 37. It is no surprise that the clubs with the most players have been performing well in their domestic leagues. Saracens have the most players, with 5, which is perhaps a little surprising as the side were relegated last season for persisted breaching the English Premiership’s salary cap. The Saracens players selected for England in the Six Nations in 2021 many spectators did not play at their best as they were out of practice playing in the top leagues.
Figure 5 - Representation by Club
Looking at the player physical characteristics, in Figure 6, we can see that (as one would expect) professional rugby players are generally quite tall. The heights are more or less normally distributed with the exception of some taller players to the far right – these are the Second Row players who are generally tall because they sit in the middle of the scrum and jump in the lineouts.
Figure 6 - Distribution of player heights
In Figure 7 we can see that there is a clear split in player weights. This is likely the split in weights between the speedy backs and bulky forwards.
Figure 7 - Distribution of player weights
When we look at height and weight together, coloured by position, as shown in Figure 8. There are clearly common attributes across the position. Props are generally shorter and heavier, locks are generally tall, etc. For anyone reading this who follows or has played rugby this won’t be a surprise.
This suggests that height and weight are not good inputs for selecting players, since all positions share common attributes. From an optimization perspective it also does not particularly make sense to maximise or minimise either of these attributes.
Figure 8 - Player Attributes by Position
Looking at the distribution of players ages, in Figure 9, you can see most players center around roughly 28 years old. The youngest player is Louis Rees-Zammit who is only 20 years old, he’s not got much experience but has had an explosive start to his career and was a prominent starting player for Wales in their winning Six Nations campaign in 2021. The oldest player is Wales captain Alun Wyn Jones, a veteran of 35 years he has vast experience and has, for the most part, been able to manage his career injuries well. He did pick up a shoulder injury at the start of the Lions tour but has, amazingly, bounced back in readiness for the first test matches.
There’s a balance to be struck between youth and experience and from an optimization perspective once again it does not make sense to simply minimize or maximise age when selecting players. The approach we take in this project is to minimize the absolute deviation from the mean age of the group. This gives an implicit preference for players who are close to their peak playing career around 28 years old.
Figure 9 - Player Age Distributions
Finally, lets take a quick look at player experience. Figure 10 shows that the vast majority of players have zero Lions caps, this makes sense because the tour only runs every four years and its tough to get into the team. For players who have played in previous tours, most players have less than four caps – the equivalent of going on just one tour. Alun Wyn Jones is an impressive outlier with 9 caps.
Figure 10 - Distribution of Player Lions Caps
In terms of international caps, most players have at least 25 caps – shown in Figure 11. There are several newer players in the squad who have not made many international appearances. These players are often referred to as ‘bolters’ and in previous Lions tours bolters have used this tour as an opportunity to make a mark and secure international attention by their home nation. Past famous bolters include the likes of former England captain Martin Johnson and Ireland’s Keith Earls.
Figure 11 - Distribution of Player International Caps
However, in almost all cases players have at least 50 appearances at premiership level, the one exception being Louis Rees-Zammit who at 20 years of age has only had around 32 appearances with Gloucester, yet an impressive 9 international caps with Wales. It will be very interesting to see whether his impressive league stats push him up the model selection list despite his relative inexperience.
Figure 12 - Distribution of Player Club Appearances
In the next article we explore the method used in building the model. Read Part 3 of the series here: Using SAS Viya to Select the British & Irish Lions Rugby Team: Part 3
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.