Editor's note: SAS programming concepts in this and other Free Data Friday articles remain useful, but SAS OnDemand for Academics has replaced SAS University Edition as a free e-learning option. Hit the orange button below to start your journey with SAS OnDemand for Academics:
Access Now
I was having an interesting conversation with friends earlier this week and the topic of “data checks you should before you do anything else” came up. I thought it was an appropriate topic for this week’s Free Data Friday post. A well-done data review can save you a lot of frustration, project delays, and back-and-forth emails / phone calls.
Get the Data
I wanted to look at data that was “messy” – missing values, bad data, etc.; one outcome of the increased interest in Open Data is that organizations are now becoming more aware of their data and cleaning it before it is published.
After a lot of digging around (weather, sports, etc.) I found a dataset that met my requirements. It’s from the City of Edmonton Open Data Portal and is a listing of the city-owned trees.
Get Started with SAS OnDemand for Academics
In this 9-minute tutorial, SAS instructor @DomWeatherspoon shows you how to get your data into SAS OnDemand for Academics and other key steps:
Get Started
Getting the data ready
The data was already in a format that I could use, as it was in standard CSV format.
The Results
The first thing I do is after importing the data (assuming no errors that indicated a review of the raw data was needed), I run PROC CONTENTS...
PROC CONTENTS DATA=WORK.IMPORT; RUN;
...which produces three tables, but I want to highlight two of them:
The first screen shot shows some rather boring information, but I’ve highlighted the key pieces. Knowing the number of observations lets me decide if, when doing my analyses and data review, if I want to limit my output to a smaller data set. For example, if I want to review the data that is missing Latitude or Longitude, I don’t necessarily need to see all the data – I am just looking for a pattern, and a sample of about 10% should give me that.
The second screenshot is the listing of the variables and their formats; this is something I put into Excel and have open as I’m writing my SAS code. Knowing this information is obviously important, and working with a datas et that has 3000 variables, I can’t remember all of their names and formats.
My next step is to typically run the Characterise Data task, specifically focussing on my Date and Numeric data; in this example, I’m only running the task on Diameter_Breast_Height, Condition_Percent, and Planted_Date:
:Nothing really interesting except when I scroll to the bottom, and see this:
I’m suspicious of the 1899 date (planting a tree in December in Edmonton? I really don’t think so!) and the 2105 date is clearly a typo. It’ll be up to you and the person requesting the analysis to decide how to handle these.
My next task is to look at the missing data; I’m only looking at Categorical Data for this example.
When I run the task I get this:
I have 3.2% of my SPECIES data missing; that may not be an issue, or if that’s the key variable, that may be enough to trigger a review of the data collection process.
I’ve decided in my review that I need more information on the SPECIES variable, so I’m going to go back to the Characterise Data task and run it specifically on this one column.
I’ve truncated the output to highlight the three SPECIES I wanted to focus on. For some reason, Cherry and Plum have brackets around them; that may be to have them sort to the top of the list, or for some reason. The more (potentially) concerning fact is that there are almost 21,000 trees whose species is listed as “X”. Combined with the missing data, that’s almost 10% of the total; even if this column is going to be a secondary part of your analysis, it should be investigated before you get too much further.
The last piece I wanted to show was a Box Plot; these are extremely handy as they show your Mean, Median, Average, and Interquartile Range all on a single output. I’m running my graph comparing the GENUS to the DIAMETER_BREAST_HEIGHT:
Here’s the output:
The first thing I see is that there are some clear outliers that I need to investigate; the diameter of >200cm seems to be a good benchmark overall, and there may be a couple of others I’d want to review (for example, the PICEA at 150cm seems to stick out). I’m also seeing that there are a number of “Unassigned” trees, and those should be pulled out of the data as well.
Context is everything
I should stress that this is not an absolute nor complete list; your data review will depend highly on what you're looking at. Missing data may be fine in genomic data, but not if you're doing housing-market analysis. "Unknown" may be appropriate if you're dealing with a catalogue of astronomical phenomenon, but not when dealing with diagnosis of a patient in healthcare data.
Finally, I’d like to recommend three fairly slim books that can set you up for success with your data analysis. I have a very strong 80/20 rule when it comes to data – I spend 80% of my time reviewing, exploring and understanding the data, so then I spend 20% of the time actually writing the code. These books all embrace that philosopy:
Data Management for Researchers by Kristen Briney (non-SAS, but very good and covers a wide range of topics such as managing your files, storing / documenting code, etc).
Data Analysis Plans: A Blueprint for Success Using SAS by Kathleen Joblanski and Mark Guagliardo
A Recipe for Success Using SAS University Edition: How to Plan Your First Analytics Project by Sharon Jones
Now it’s your turn!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
... View more