We’re smarter together. Learn from this collection of community knowledge and add your expertise.

Summarizing your data - the first step in data cleanliness

by Super Contributor on ‎04-21-2017 02:07 PM (1,466 Views)

I've often talked about the importance of summarising and reviewing your data before going too far into your analysis. Spending a couple of hours making sure everything "makes sense" could potentially save you a lot of time in the long run.


As I work in IT, and I have a long family history of mental health issues, I was interested to find a survey that looks at technology workers and their experiences with mental health.  I wanted to see what sorts of interesting things were in this data, so a data summary is the logical first step. FreeDataFriday_graphic.jpg


Get the Data

You can get the data from here; it's a standard CSV file.  I would recommend not opening it in Excel first as the "Number of Employees" column has the values '1-5' and '6-25' which Excel then annoyingly converts to dates.  Directly importing the data to SAS University Edition works fine, so no need to open the file beforehand. 


How to go about getting SAS University Edition

If you don’t already have University Edition, get it here and follow the instructions from the pdf carefully. If you need help with almost any aspect of using University Edition, check out these video tutorials. Additional resources are available in this article.


Getting the data ready

The data was already in a format that I could use, as it was in standard CSV format. 


The Results

In SAS University Edition there are a couple of tasks that I use -- one is the aptly named "Summarize Data." I normally would do a thorough review of all the data, but for the purposes of this article I'm going to focus on two variables. The first is the Age of the Employee who filled out the survey.  When I run my initial Summarize Data task on the dataset, (only specifying Age in the Analysis Variable section and Employees in the Classification section), I get this:




Right away, we see something bizarre - the mean age for 1-5 Employees is 617,283,983. Obviously, something went very wrong somewhere!


I could pinpoint the errant data a couple of ways, but I went with a quick PROC SQL query to pull out those employees who work for companies with 1 - 5 staff:




I scroll through the output and quickly see the issue:




Now, if this were data where I had additional information, I may do some digging to find out this person's real age; however, for open data like this we don't have that option, so I decide to create a new table dropping those people who put an age less than 17 and over 100:




Now, when I run the query I get a much cleaner result set (and from the looks of it, there were about 8 observations that were dropped):




Interestingly enough, though there is a significant spread (18 - 72), the mean age for each group hovers around 32 years of age, with the standard deviation being fairly stable between 6 and 8 years.


Because this is a survey on mental health, let's throw in the Family History variable as a secondary Classification variable:




Once again, there seems to be consistency within the groups as far as age; within the Employee groups, there also seems to be a consistent 60/40 split between those who do not have and those who have a Family History.


The last analysis I wanted to see was grouping the output by whether or not the employees sought help for their own personal mental health issues. This is in the "Additional Roles" section and is the "Group Analysis By" option.  When I run this, I get:




I realise this may seem a bit excessive, but the next image proves why this is important.  When I run the task looking at Gender, I get a huge variety of responses; most are relatively easy to bucket (M, cis Male, M, etc) but if Gender is a variable of interest, understanding that "Guy-ish" and "Agender" are very different is going to be critical.  If the assumption is that the responses will be the socially typical "Male" / "Female", this data will definitely throw a curve plan in your plans. 




Now it’s your turn!


Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.


Need data for learning?


The SAS Communities Library has a growing supply of free data sources that you can use in your training to become a data scientist. The easiest way to find articles about data sources is to type "Data for learning" in the communities site search field like so:




We publish all articles about free data sources under the Analytics U label in the SAS Communities Library. Want email notifications when we add new content? Subscribe to the Analytics U label by clicking "Find A Community" in the right nav and selecting SAS Communities Library at the bottom of the list. In the Labels box in the right nav, click Analytics U:




Click Analytics U, then select "Subscribe" from the Options menu.


Happy Learning!



Your turn
Sign In!

Want to write an article? Sign in with your profile.

Looking for the Ask the Expert series? Find it in its new home: communities.sas.com/askexpert.