Editor's note: SAS programming concepts in this and other Free Data Friday articles remain useful, but SAS OnDemand for Academics has replaced SAS University Edition as a free e-learning option. Hit the orange button below to start your journey with SAS OnDemand for Academics:
A friend has a poster that paraphrases a certain animated hunter who relentlessly pursued a certain wascally wabbit. “Be vewy vewy quiet," it reads. "I'm hunting outwiers!”
It speaks to a great point: Care and patience are absolutely critical as you track down outliers.
Today's Free Data Friday post builds on a previous one -- How to Assess Messy Data: Step One - Data Review -- so that a closer examination of those data will prevent you from going down a rabbit hole. Outliers in your data are sneaky and elusive. You must handle them with finesse.
As with my last post, How To Assess Messy Data: Step One - Data Review, I’m using data from the City of Edmonton, Alberta, and is a listing of the city-owned trees.
The data was already in a format that I could use, as it was in standard CSV format.
Now that we’ve done a review of the data, the next step is to find and figure out what we’re going to do with our outliers. Outliers can be caused in many ways, for many reasons; they are not automatically “bad” and so require some investigation. I tend to have three categories that I classify outliers:
Each of the three levels require some sort of investigation as to the cause; is it a data collection issue, an issue of transferring information from handwritten notes to electronic, or something more complex like a database join problem? I had one memorable experience when I was very new to analysis and I had a Level I outlier. It turned out the data was all correct, but when transferred from one system to another, some dates in like 08-09-18 (September 18, 2008) were read as August 9, 2018 and so threw off all our forecasts. We didn’t realise this until we started doing a row-by-row comparison of the datasets.
So, turning our attention back to our Tress data, let’s see if we can find some outliers. First off, let’s take a look at the age of the trees:
proc sql;
create table work.TreesA as
select ('03FEB2017'd-PLANTED_DATE)/365 as TreeAge
from work.import;
select
min(TreeAge),
max(TreeAge) from work.TreesA;
quit;
Here’s the output, showing the Min and Max ages:
So clearly the -88 years is wrong. The 117 year-old tree(s) however may be correct; if you recall from last week’s article however, the date planted was December, 1899. If you have ever been to Edmonton in winter, you know tree planting is not something you’d want to do, and especially without the heavy machinery that would not have been available in 1899.
When I use the Histogram task to see what the distribution looks like, I get this:
So from the data, about 70% of the trees were planted all about 25 years ago. Without knowing the history of Edmonton, I would guess that there was a population boom and so many new houses were built, and therefore new trees planted.
What happens when I remove the outliers from the dataset (yes, I could have used the WHERE clause in the Histogram task, but it’s a personal quirk of mine that I like having separate tables when I’m doing my data review)?
proc sql;
create table work.TreesB as
select ('03FEB2017'd-PLANTED_DATE)/365 as TreeAge
from work.import
where ('03FEB2017'd-PLANTED_DATE)/365 <100
and ('03FEB2017'd-PLANTED_DATE)/365>0;
quit;
We now have a better sense of the data that can potentially be used in our analyses.
Next, I want to make sure that the geolocation data makes sense; although SAS University Edition doesn’t have mapping capabilities, there are still some interesting things you can do (topics for another time).
Here’s a scatterplot of the data:
In looking at the Latitudes / Longitudes in Google Earth, I realised that the -113.7 was North, so the scatterplot is upside down. Realising that, you can see the path of the river that cuts through the middle of the city (bottom left through to the top right). There are a couple of potential outliers, but nothing that sets off warning bells.
Although it’s time consuming, doing a thorough review of your outliers allows you to have a better understanding of the data. Questioning the users and database administrators about the data that falls into the three categories allows you as an analyst to better understand the business processes that go into the data collection, but also allows you to provide better insights for the people making decisions based on your reports.
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
An interesting outlier I had in one project related to age and Sexually Transmitted Disease testing. When I started working with a new project and inherited data I used some of these outlier detection approaches. I discovered that we had a spike of age at testing of 100 years. Further investigation showed that every one of these folks had gotten an STD test on their hundredth birthday.
Actually the actual birth date had not been recorded at the time of testing but one of the data systems required one. So the decision had been made to use the date of the test pushed back 100 years as the birthdate.
@ballardw Clearly, it's a rite of passage for any centenarian.
@ballardw thanks for the real-world example! I have found similar experiences but nothing so dramatic - you definitely "win" the prize haha. have a great weekend and chat soon!!
Chris
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.