The 2016 Rio Olympic Games are under way, so I wanted to play around with some related data. I have always been impressed by the level of athletic achievement that these people accomplish, and have always seen them as a symbol of hope that we can in fact work and play side-by-side.
Get the Data
I found the London 2012 dataset (available here). The data was compiled by the UK newspaper The Guardian, and the spreadsheet was update in real time during the 2012 Olympics.
How to go about getting SAS University Edition
If you don’t already have University Edition, get it here and follow the instructions from the pdf carefully. If you need help with almost any aspect of using University Edition, check out these video tutorials. Additional resources are available in this article.
Getting the data ready
I had to do a significant amount of work to clean the data in preparation for this article. If you’re interested in receiving the cleaned version, please post a comment or message me and I will happily share it with you. Data cleaning ranged from deleting a number of rows where the data was in the wrong column (for example, 41 cases where the Athlete’s name was in the country field, and no country indicated).
Also, and this may have been because I’m on a Mac, when I was merging the multiple spreadsheets, I was getting different data types for the dates. There is also one date that reads “2012/08/10+A1” and this caused problems importing to SAS until I found it and fixed it.
Because of the nature of the data, I felt that doing some graphs would be the best way to highlight some of the interesting things I’ve found.
The first graph I wanted to look at was Record counts. Here’s the task I set up:
Before looking at the graph, here’s the description of the acronyms (taken from Wikipedia):
WR – World Record
OR – Olympic Record
AR – Area (or Continental Record)
NR – National Record (for a specific country)
OB – Olympic Best
PB – Personal Best
SB – Season’s Best
Here is the graph of the records:
I find it extremely impressive the number of Personal and Season’s bests the athletes put forward. I don’t know how this compares to other years, but regardless it once again shows the quality of athleticism these people have.
So let’s take this graph a step further and add Sex to the mix, and see how Women and Men compare.
Here’s the graph:
Female Athletes Surpass Male Counterparts in Records, Personal Bests
The women far outdid the men as far as new records; they received more National, Olympic and World Records, and by far had more Personal Bests than the men. I would be very interested in comparing women and men by sport – the assumption is that women are not as fast / strong as men, but I based on this graph, I’d actually like to see how they compare.
For the next graph, I have to run a quick SQL query to create a new summary table. I’m taking the Sex and Date, and doing a count. As with all summary queries in SQL, I need to have the Group By and Order By statements at the end.
Here’s the output. The dots indicate that it was a team event, such as rowing or basketball; although the teams are male / female, they’re not a single athlete.
Here’s the task as I’ve prepared it; remember the “total” is the new variable I created.
Because I’m using dates as one of my variables, I’m creating a line chart.
You can see the team sports spike early then go down, with a slight bump near the end (assumption is that this was the finals matches for those events). I assume that the spike in the middle for the men / women is an overlap of the finals for the events from the first half, and then the preliminary matches for the second half. With some work on manipulating the data, I’m sure this could be teased out.
The last set of graphs I wanted to look at were the events that had the highest number of records by sex. Here is the query I used to create my table:
This is the task I’ve set up. Because I’m using the Event name (which in some cases is quite long) I’m using a horizontal bar chart – this gives the labels more room without compromising the quality of the graph. You’ll also note that I’ve specified a Where clause; because I want the men and women in separate graphs, I need to split them apart. I could have created two separate tables in my SQL query, but using the Where clause in the task is much more efficient.
The first graph is looking at the Women; 11 records were achieved in the Women’s Heptathlon Hurdles, and 9 were achieved in Women’s 10,000m Final, 100m Preliminary, and the 5000m Round 1.
For the men, 10 records were achieved in the Men’s Decathlon 100m. and 8 in each of the 100m Preliminaries and 100m Round 1.
I don’t distinguish between type of Record, but this could have been done by adding the variable into the SELECT, GROUP BY and ORDER BY statements and then putting it as a Group variable in the task.
Now it’s your turn!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Need data for learning?
The SAS Communities Library has a growing supply of free data sources that you can use in your training to become a data scientist. The easiest way to find articles about data sources is to type "Data for learning" in the communities site search field like so:
We publish all articles about free data sources under the Analytics U label in the SAS Communities Library. Want email notifications when we add new content? Subscribe to the Analytics U label by clicking "Find A Community" in the right nav and selecting SAS Communities Library at the bottom of the list. In the Labels box in the right nav, click Analytics U:
Click Analytics U, then select "Subscribe" from the Options menu.