Editor's note: SAS programming concepts in this and other Free Data Friday articles remain useful, but SAS OnDemand for Academics has replaced SAS University Edition as a free e-learning option. Hit the orange button below to start your journey with SAS OnDemand for Academics:
A friend has a poster that paraphrases a certain animated hunter who relentlessly pursued a certain wascally wabbit. “Be vewy vewy quiet," it reads. "I'm hunting outwiers!”
It speaks to a great point: Care and patience are absolutely critical as you track down outliers.
Today's Free Data Friday post builds on a previous one -- How to Assess Messy Data: Step One - Data Review -- so that a closer examination of those data will prevent you from going down a rabbit hole. Outliers in your data are sneaky and elusive. You must handle them with finesse.
The data was already in a format that I could use, as it was in standard CSV format.
Now that we’ve done a review of the data, the next step is to find and figure out what we’re going to do with our outliers. Outliers can be caused in many ways, for many reasons; they are not automatically “bad” and so require some investigation. I tend to have three categories that I classify outliers:
Each of the three levels require some sort of investigation as to the cause; is it a data collection issue, an issue of transferring information from handwritten notes to electronic, or something more complex like a database join problem? I had one memorable experience when I was very new to analysis and I had a Level I outlier. It turned out the data was all correct, but when transferred from one system to another, some dates in like 08-09-18 (September 18, 2008) were read as August 9, 2018 and so threw off all our forecasts. We didn’t realise this until we started doing a row-by-row comparison of the datasets.
So, turning our attention back to our Tress data, let’s see if we can find some outliers. First off, let’s take a look at the age of the trees:
proc sql; create table work.TreesA as select ('03FEB2017'd-PLANTED_DATE)/365 as TreeAge from work.import; select min(TreeAge), max(TreeAge) from work.TreesA; quit;
Here’s the output, showing the Min and Max ages:
So clearly the -88 years is wrong. The 117 year-old tree(s) however may be correct; if you recall from last week’s article however, the date planted was December, 1899. If you have ever been to Edmonton in winter, you know tree planting is not something you’d want to do, and especially without the heavy machinery that would not have been available in 1899.
When I use the Histogram task to see what the distribution looks like, I get this:
So from the data, about 70% of the trees were planted all about 25 years ago. Without knowing the history of Edmonton, I would guess that there was a population boom and so many new houses were built, and therefore new trees planted.
What happens when I remove the outliers from the dataset (yes, I could have used the WHERE clause in the Histogram task, but it’s a personal quirk of mine that I like having separate tables when I’m doing my data review)?
proc sql; create table work.TreesB as select ('03FEB2017'd-PLANTED_DATE)/365 as TreeAge from work.import where ('03FEB2017'd-PLANTED_DATE)/365 <100 and ('03FEB2017'd-PLANTED_DATE)/365>0; quit;
We now have a better sense of the data that can potentially be used in our analyses.
Next, I want to make sure that the geolocation data makes sense; although SAS University Edition doesn’t have mapping capabilities, there are still some interesting things you can do (topics for another time).
Here’s a scatterplot of the data:
In looking at the Latitudes / Longitudes in Google Earth, I realised that the -113.7 was North, so the scatterplot is upside down. Realising that, you can see the path of the river that cuts through the middle of the city (bottom left through to the top right). There are a couple of potential outliers, but nothing that sets off warning bells.
Although it’s time consuming, doing a thorough review of your outliers allows you to have a better understanding of the data. Questioning the users and database administrators about the data that falls into the three categories allows you as an analyst to better understand the business processes that go into the data collection, but also allows you to provide better insights for the people making decisions based on your reports.
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.