I was having an interesting conversation with friends earlier this week and the topic of “data checks you should before you do anything else” came up. I thought it was an appropriate topic for this week’s Free Data Friday post. A well-done data review can save you a lot of frustration, project delays, and back-and-forth emails / phone calls.
Get the Data
I wanted to look at data that was “messy” – missing values, bad data, etc.; one outcome of the increased interest in Open Data is that organizations are now becoming more aware of their data and cleaning it before it is published.
How to go about getting SAS University Edition
If you don’t already have University Edition, get it here and follow the instructions from the pdf carefully.
Getting the data ready
The data was already in a format that I could use, as it was in standard CSV format.
The first thing I do is after importing the data (assuming no errors that indicated a review of the raw data was needed), I run PROC CONTENTS...
PROC CONTENTS DATA=WORK.IMPORT; RUN;
...which produces three tables, but I want to highlight two of them:
The first screen shot shows some rather boring information, but I’ve highlighted the key pieces. Knowing the number of observations lets me decide if, when doing my analyses and data review, if I want to limit my output to a smaller data set. For example, if I want to review the data that is missing Latitude or Longitude, I don’t necessarily need to see all the data – I am just looking for a pattern, and a sample of about 10% should give me that.
The second screenshot is the listing of the variables and their formats; this is something I put into Excel and have open as I’m writing my SAS code. Knowing this information is obviously important, and working with a dataset that has 3000 variables, I can’t remember all of their names and formats.
My next step is to typically run the Characterise Data task, specifically focussing on my Date and Numeric data; in this example, I’m only running the task on Diameter_Breast_Height, Condition_Percent, and Planted_Date:
:Nothing really interesting except when I scroll to the bottom, and see this:
I’m suspicious of the 1899 date (planting a tree in December in Edmonton? I really don’t think so!) and the 2105 date is clearly a typo. It’ll be up to you and the person requesting the analysis to decide how to handle these.
My next task is to look at the missing data; I’m only looking at Categorical Data for this example.
When I run the task I get this:
I have 3.2% of my SPECIES data missing; that may not be an issue, or if that’s the key variable, that may be enough to trigger a review of the data collection process.
I’ve decided in my review that I need more information on the SPECIES variable, so I’m going to go back to the Characterise Data task and run it specifically on this one column.
I’ve truncated the output to highlight the three SPECIES I wanted to focus on. For some reason, Cherry and Plum have brackets around them; that may be to have them sort to the top of the list, or for some reason. The more (potentially) concerning fact is that there are almost 21,000 trees whose species is listed as “X”. Combined with the missing data, that’s almost 10% of the total; even if this column is going to be a secondary part of your analysis, it should be investigated before you get too much further.
The last piece I wanted to show was a Box Plot; these are extremely handy as they show your Mean, Median, Average, and Interquartile Range all on a single output. I’m running my graph comparing the GENUS to the DIAMETER_BREAST_HEIGHT:
Here’s the output:
The first thing I see is that there are some clear outliers that I need to investigate; the diameter of >200cm seems to be a good benchmark overall, and there may be a couple of others I’d want to review (for example, the PICEA at 150cm seems to stick out). I’m also seeing that there are a number of “Unassigned” trees, and those should be pulled out of the data as well.
Context is everything
I should stress that this is not an absolute nor complete list; your data review will depend highly on what you're looking at. Missing data may be fine in genomic data, but not if you're doing housing-market analysis. "Unknown" may be appropriate if you're dealing with a catalogue of astronomical phenomenon, but not when dealing with diagnosis of a patient in healthcare data.
Finally, I’d like to recommend three fairly slim books that can set you up for success with your data analysis. I have a very strong 80/20 rule when it comes to data – I spend 80% of my time reviewing, exploring and understanding the data, so then I spend 20% of the time actually writing the code. These books all embrace that philosopy:
Now it’s your turn!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Need data for learning?
The SAS Communities Library has a growing supply of free data sources that you can use in your training to become a data scientist. The easiest way to find articles about data sources is to type "Data for learning" in the communities site search field like so:
We publish all articles about free data sources under the Analytics U label in the SAS Communities Library. Want email notifications when we add new content? Subscribe to the Analytics U label by clicking "Find A Community" in the right nav and selecting SAS Communities Library at the bottom of the list. In the Labels box in the right nav, click Analytics U:
Click Analytics U, then select "Subscribe" from the Options menu.