BookmarkSubscribeRSS Feed

How to Assess and Finesse Messy Data: Step Two - Outliers

Started ‎02-10-2017 by
Modified ‎08-04-2021 by
Views 2,239

Editor's note: SAS programming concepts in this and other Free Data Friday articles remain useful, but SAS OnDemand for Academics has replaced SAS University Edition as a free e-learning option. Hit the orange button below to start your journey with SAS OnDemand for Academics:

 

Access Now

Scrutinize your outliers.jpg

 

A friend has a poster that paraphrases a certain animated hunter who relentlessly pursued a certain wascally wabbit. “Be vewy vewy quiet," it reads. "I'm hunting outwiers!”

 

It speaks to a great point: Care and patience are absolutely critical as you track down outliers.

 

Today's Free Data Friday post builds on a previous one -- How to Assess Messy Data: Step One - Data Review -- so that a closer examination of those data will prevent you from going down a rabbit hole. Outliers in your data are sneaky and elusive. You must handle them with finesse.

 

Get the data

As with my last post, How To Assess Messy Data: Step One - Data Review, I’m using data from the City of Edmonton, Alberta, and is a listing of the city-owned trees.

 

Get started with SAS OnDemand for Academics

 
In this 9-minute tutorial, SAS instructor @DomWeatherspoon shows you how to get your data into SAS OnDemand for Academics and other key steps:
 

Get Started

 

Get the data readyFreeDataFriday_graphic.jpg

The data was already in a format that I could use, as it was in standard CSV format.

 

The results

Now that we’ve done a review of the data, the next step is to find and figure out what we’re going to do with our outliers.  Outliers can be caused in many ways, for many reasons; they are not automatically “bad” and so require some investigation.  I tend to have three categories that I classify outliers:

 

  • Level I – Clearly an error (for example, someone’s age coming up as 150 years old)
  • Level II – Likely an error (for example, a couple listed as having 10 kids under 6 years of age)
  • Level III – Potentially an error (for example, Harold J. Smith and Harry Joseph Smith with different addresses)

Each of the three levels require some sort of investigation as to the cause; is it a data collection issue, an issue of transferring information from handwritten notes to electronic, or something more complex like a database join problem? I had one memorable experience when I was very new to analysis and I had a Level I outlier. It turned out the data was all correct, but when transferred from one system to another, some dates in like 08-09-18 (September 18, 2008) were read as August 9, 2018 and so threw off all our forecasts. We didn’t realise this until we started doing a row-by-row comparison of the datasets.

 

So, turning our attention back to our Tress data, let’s see if we can find some outliers. First off, let’s take a look at the age of the trees:

 

 

proc sql;
create table work.TreesA as
select  ('03FEB2017'd-PLANTED_DATE)/365 as TreeAge
from work.import;
select
min(TreeAge),
max(TreeAge) from work.TreesA;
quit;

 

Here’s the output, showing the Min and Max ages:

 

 2.png

 

So clearly the -88 years is wrong.  The 117 year-old tree(s) however may be correct; if you recall from last week’s article however, the date planted was December, 1899.  If you have ever been to Edmonton in winter, you know tree planting is not something you’d want to do, and especially without the heavy machinery that would not have been available in 1899. 

 

When I use the Histogram task to see what the distribution looks like, I get this:

3.png

4.png

So from the data, about 70% of the trees were planted all about 25 years ago.  Without knowing the history of Edmonton, I would guess that there was a population boom and so many new houses were built, and therefore new trees planted.

 

What happens when I remove the outliers from the dataset (yes, I could have used the WHERE clause in the Histogram task, but it’s a personal quirk of mine that I like having separate tables when I’m doing my data review)?

 

proc sql;
create table work.TreesB as
select  ('03FEB2017'd-PLANTED_DATE)/365 as TreeAge
from work.import
where ('03FEB2017'd-PLANTED_DATE)/365 <100
and ('03FEB2017'd-PLANTED_DATE)/365>0;
quit;

 

 5.png

We now have a better sense of the data that can potentially be used in our analyses.

 

Next, I want to make sure that the geolocation data makes sense; although SAS University Edition doesn’t have mapping capabilities, there are still some interesting things you can do (topics for another time).

 

Here’s a scatterplot of the data:

 

6.png

 

In looking at the Latitudes / Longitudes in Google Earth, I realised that the -113.7 was North, so the scatterplot is upside down.  Realising that, you can see the path of the river that cuts through the middle of the city (bottom left through to the top right).  There are a couple of potential outliers, but nothing that sets off warning bells.

 

 7.png

 

Although it’s time consuming, doing a thorough review of your outliers allows you to have a better understanding of the data.  Questioning the users and database administrators about the data that falls into the three categories allows you as an analyst to better understand the business processes that go into the data collection, but also allows you to provide better insights for the people making decisions based on your reports.   

 

Now it’s your turn!

 

Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.

 

Comments

An interesting outlier I had in one project related to age and Sexually Transmitted Disease testing. When I started working with a new project and inherited data I used some of these outlier detection approaches. I discovered that we had a spike of age at testing of 100 years. Further investigation showed that every one of these folks had gotten an STD test on their hundredth birthday.

 

Actually the actual birth date had not been recorded at the time of testing but one of the data systems required one. So the decision had been made to use the date of the test pushed back 100 years as the birthdate.

@ballardw Clearly, it's a rite of passage for any centenarian.

@ballardw thanks for the real-world example!  I have found similar experiences but nothing so dramatic - you definitely "win" the prize haha.  have a great weekend and chat soon!!

 

Chris

Version history
Last update:
‎08-04-2021 10:22 AM
Updated by:

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags