We’re smarter together. Learn from this collection of community knowledge and add your expertise.

Data Analysis and the Summer Games

by Super Contributor on ‎08-19-2016 02:42 PM (493 Views)

The 2016 Rio Olympic Games are under way, so I wanted to play around with some related data. I have always been impressed by the level of athletic achievement that these people accomplish, and have always seen them as a symbol of hope that we can in fact work and play side-by-side. gamesdata.jpg


Get the Data

I found the London 2012 dataset (available here). The data was compiled by the UK newspaper The Guardian, and the spreadsheet was update in real time during the 2012 Olympics. 


How to go about getting SAS University Edition

If you don’t already have University Edition, get it here and follow the instructions from the pdf carefully. If you need help with almost any aspect of using University Edition, check out these video tutorials. Additional resources are available in this article.


Getting the data ready

I had to do a significant amount of work to clean the data in preparation for this article. If you’re interested in receiving the cleaned version, please post a comment or message me and I will happily share it with you. Data cleaning ranged from deleting a FreeDataFriday_graphic.jpgnumber of rows where the data was in the wrong column (for example, 41 cases where the Athlete’s name was in the country field, and no country indicated).


Also, and this may have been because I’m on a Mac, when I was merging the multiple spreadsheets, I was getting different data types for the dates. There is also one date that reads “2012/08/10+A1” and this caused problems importing to SAS until I found it and fixed it.


The Results

Because of the nature of the data, I felt that doing some graphs would be the best way to highlight some of the interesting things I’ve found.


The first graph I wanted to look at was Record counts.  Here’s the task I set up:




Before looking at the graph, here’s the description of the acronyms (taken from Wikipedia):

WR – World Record

OR – Olympic Record

AR – Area (or Continental Record)

NR – National Record (for a specific country)

OB – Olympic Best

PB – Personal Best

SB – Season’s Best


Here is the graph of the records:



I find it extremely impressive the number of Personal and Season’s bests the athletes put forward. I don’t know how this compares to other years, but regardless it once again shows the quality of athleticism these people have.


So let’s take this graph a step further and add Sex to the mix, and see how Women and Men compare.




Here’s the graph:




Female Athletes Surpass Male Counterparts in Records, Personal Bests

The women far outdid the men as far as new records; they received more National, Olympic and World Records, and by far had more Personal Bests than the men. I would be very interested in comparing women and men by sport – the assumption is that women are not as fast / strong as men, but I based on this graph, I’d actually like to see how they compare. 


For the next graph, I have to run a quick SQL query to create a new summary table. I’m taking the Sex and Date, and doing a count. As with all summary queries in SQL, I need to have the Group By and Order By statements at the end. 



Here’s the output.  The dots indicate that it was a team event, such as rowing or basketball; although the teams are male / female, they’re not a single athlete.




Here’s the task as I’ve prepared it; remember the “total” is the new variable I created.




Because I’m using dates as one of my variables, I’m creating a line chart. 




You can see the team sports spike early then go down, with a slight bump near the end (assumption is that this was the finals matches for those events).  I assume that the spike in the middle for the men / women is an overlap of the finals for the events from the first half, and then the preliminary matches for the second half.  With some work on manipulating the data, I’m sure this could be teased out.


The last set of graphs I wanted to look at were the events that had the highest number of records by sex.  Here is the query I used to create my table:



This is the task I’ve set up.  Because I’m using the Event name (which in some cases is quite long) I’m using a horizontal bar chart – this gives the labels more room without compromising the quality of the graph.  You’ll also note that I’ve specified a Where clause; because I want the men and women in separate graphs, I need to split them apart.  I could have created two separate tables in my SQL query, but using the Where clause in the task is much more efficient.




The first graph is looking at the Women; 11 records were achieved in the Women’s Heptathlon Hurdles, and 9 were achieved in Women’s 10,000m Final, 100m Preliminary, and the 5000m Round 1.




For the men, 10 records were achieved in the Men’s Decathlon 100m. and 8 in each of the 100m Preliminaries and 100m Round 1. 




I don’t distinguish between type of Record, but this could have been done by adding the variable into the SELECT, GROUP BY and ORDER BY statements and then putting it as a Group variable in the task.


Now it’s your turn!


Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.


Need data for learning?


The SAS Communities Library has a growing supply of free data sources that you can use in your training to become a data scientist. The easiest way to find articles about data sources is to type "Data for learning" in the communities site search field like so:




We publish all articles about free data sources under the Analytics U label in the SAS Communities Library. Want email notifications when we add new content? Subscribe to the Analytics U label by clicking "Find A Community" in the right nav and selecting SAS Communities Library at the bottom of the list. In the Labels box in the right nav, click Analytics U:




Click Analytics U, then select "Subscribe" from the Options menu.


Happy Learning!




by Super User
on ‎08-19-2016 04:02 PM

Hi, can you send me the data please?

I want to try and recreate some of this as a Jupyter Notebook...I love the concept of the JN and curious to see how it works for demo purposes.

by Super Contributor
on ‎08-19-2016 08:46 PM

Shall do!  Actually I'll throw it into a Dropbox folder and share it here - it's open data, may as well keep it open :-)  WIll have to be tomorrow - need time to make room in by DB account LOL.


I'll give you both the "clean" and "dirty" data sets.  


Let me know how you find the JN.  i've been working (slowly) on an article about it and finding it an intriguing platform.


Chat soon



by Super Contributor
on ‎08-20-2016 08:30 AM

Data (both original data and the one I cleaned and used for the blog) can be found here.  @Reeza let me know if you would like any other datasets, I have tons :-)


Chat soon



Your turn
Sign In!

Want to write an article? Sign in with your profile.

Looking for the Ask the Expert series? Find it in its new home: communities.sas.com/askexpert.