A friend has a poster that paraphrases a certain animated hunter who relentlessly pursued a certain wascally wabbit. “Be vewy vewy quiet," it reads. "I'm hunting outwiers!”
It speaks to a great point: Care and patience are absolutely critical as you track down outliers.
Today's Free Data Friday post builds on a previous one -- How to Assess Messy Data: Step One - Data Review -- so that a closer examination of those data will prevent you from going down a rabbit hole. Outliers in your data are sneaky and elusive. You must handle them with finesse.
Get the Data
How to go about getting SAS University Edition
If you don’t already have University Edition, get it here and follow the instructions from the pdf carefully. If you need help with almost any aspect of using University Edition, check out these video tutorials. Additional resources are available in this article.
Getting the data ready
The data was already in a format that I could use, as it was in standard CSV format.
Now that we’ve done a review of the data, the next step is to find and figure out what we’re going to do with our outliers. Outliers can be caused in many ways, for many reasons; they are not automatically “bad” and so require some investigation. I tend to have three categories that I classify outliers:
Each of the three levels require some sort of investigation as to the cause; is it a data collection issue, an issue of transferring information from handwritten notes to electronic, or something more complex like a database join problem? I had one memorable experience when I was very new to analysis and I had a Level I outlier. It turned out the data was all correct, but when transferred from one system to another, some dates in like 08-09-18 (September 18, 2008) were read as August 9, 2018 and so threw off all our forecasts. We didn’t realise this until we started doing a row-by-row comparison of the datasets.
So, turning our attention back to our Tress data, let’s see if we can find some outliers. First off, let’s take a look at the age of the trees:
proc sql; create table work.TreesA as select ('03FEB2017'd-PLANTED_DATE)/365 as TreeAge from work.import; select min(TreeAge), max(TreeAge) from work.TreesA; quit;
Here’s the output, showing the Min and Max ages:
So clearly the -88 years is wrong. The 117 year-old tree(s) however may be correct; if you recall from last week’s article however, the date planted was December, 1899. If you have ever been to Edmonton in winter, you know tree planting is not something you’d want to do, and especially without the heavy machinery that would not have been available in 1899.
When I use the Histogram task to see what the distribution looks like, I get this:
So from the data, about 70% of the trees were planted all about 25 years ago. Without knowing the history of Edmonton, I would guess that there was a population boom and so many new houses were built, and therefore new trees planted.
What happens when I remove the outliers from the dataset (yes, I could have used the WHERE clause in the Histogram task, but it’s a personal quirk of mine that I like having separate tables when I’m doing my data review)?
proc sql; create table work.TreesB as select ('03FEB2017'd-PLANTED_DATE)/365 as TreeAge from work.import where ('03FEB2017'd-PLANTED_DATE)/365 <100 and ('03FEB2017'd-PLANTED_DATE)/365>0; quit;
We now have a better sense of the data that can potentially be used in our analyses.
Next, I want to make sure that the geolocation data makes sense; although SAS University Edition doesn’t have mapping capabilities, there are still some interesting things you can do (topics for another time).
Here’s a scatterplot of the data:
In looking at the Latitudes / Longitudes in Google Earth, I realised that the -113.7 was North, so the scatterplot is upside down. Realising that, you can see the path of the river that cuts through the middle of the city (bottom left through to the top right). There are a couple of potential outliers, but nothing that sets off warning bells.
Although it’s time consuming, doing a thorough review of your outliers allows you to have a better understanding of the data. Questioning the users and database administrators about the data that falls into the three categories allows you as an analyst to better understand the business processes that go into the data collection, but also allows you to provide better insights for the people making decisions based on your reports.
Now it’s your turn!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Need data for learning?
The SAS Communities Library has a growing supply of free data sources that you can use in your training to become a data scientist. The easiest way to find articles about data sources is to type "Data for learning" in the communities site search field like so:
We publish all articles about free data sources under the Analytics U label in the SAS Communities Library. Want email notifications when we add new content? Subscribe to the Analytics U label by clicking "Find A Community" in the right nav and selecting SAS Communities Library at the bottom of the list. In the Labels box in the right nav, click Analytics U:
Click Analytics U, then select "Subscribe" from the Options menu.