We’re smarter together. Learn from this collection of community knowledge and add your expertise.

How to Assess and Finesse Messy Data: Step Two - Outliers

by Regular Contributor on ‎02-10-2017 09:40 AM (863 Views)

Scrutinize your outliers.jpgA friend has a poster that paraphrases a certain animated hunter who relentlessly pursued a certain wascally wabbit. “Be vewy vewy quiet," it reads. "I'm hunting outwiers!”

 

It speaks to a great point: Care and patience are absolutely critical as you track down outliers.

 

Today's Free Data Friday post builds on a previous one -- How to Assess Messy Data: Step One - Data Review -- so that a closer examination of those data will prevent you from going down a rabbit hole. Outliers in your data are sneaky and elusive. You must handle them with finesse.  

 

Get the Data

As with my last post, How To Assess Messy Data: Step One - Data Review, I’m using data from the City of Edmonton, Alberta, and is a listing of the city-owned trees.

 

How to go about getting SAS University Edition

If you don’t already have University Edition, get it here and follow the instructions from the pdf carefully. If you need help with almost any aspect of using University Edition, check out these video tutorials. Additional resources are available in this article.

 

Getting the data readyFreeDataFriday_graphic.jpg

The data was already in a format that I could use, as it was in standard CSV format.

 

The Results

Now that we’ve done a review of the data, the next step is to find and figure out what we’re going to do with our outliers.  Outliers can be caused in many ways, for many reasons; they are not automatically “bad” and so require some investigation.  I tend to have three categories that I classify outliers:

 

  • Level I – Clearly an error (for example, someone’s age coming up as 150 years old)
  • Level II – Likely an error (for example, a couple listed as having 10 kids under 6 years of age)
  • Level III – Potentially an error (for example, Harold J. Smith and Harry Joseph Smith with different addresses)

Each of the three levels require some sort of investigation as to the cause; is it a data collection issue, an issue of transferring information from handwritten notes to electronic, or something more complex like a database join problem?  I had one memorable experience when I was very new to analysis and I had a Level I outlier.  It turned out the data was all correct, but when transferred from one system to another, some dates in like 08-09-18 (September 18, 2008) were read as August 9, 2018 and so threw off all our forecasts.  We didn’t realise this until we started doing a row-by-row comparison of the datasets.

 

So, turning our attention back to our Tress data, let’s see if we can find some outliers.  First off, let’s take a look at the age of the trees:

 

 

proc sql;
create table work.TreesA as
select  ('03FEB2017'd-PLANTED_DATE)/365 as TreeAge
from work.import;
select
min(TreeAge),
max(TreeAge) from work.TreesA;
quit;

 

Here’s the output, showing the Min and Max ages:

 2.png

 

So clearly the -88 years is wrong.  The 117 year-old tree(s) however may be correct; if you recall from last week’s article however, the date planted was December, 1899.  If you have ever been to Edmonton in winter, you know tree planting is not something you’d want to do, and especially without the heavy machinery that would not have been available in 1899. 

 

When I use the Histogram task to see what the distribution looks like, I get this:

3.png

4.png

So from the data, about 70% of the trees were planted all about 25 years ago.  Without knowing the history of Edmonton, I would guess that there was a population boom and so many new houses were built, and therefore new trees planted.

 

What happens when I remove the outliers from the dataset (yes, I could have used the WHERE clause in the Histogram task, but it’s a personal quirk of mine that I like having separate tables when I’m doing my data review)?

 

proc sql;
create table work.TreesB as
select  ('03FEB2017'd-PLANTED_DATE)/365 as TreeAge
from work.import
where ('03FEB2017'd-PLANTED_DATE)/365 <100
and ('03FEB2017'd-PLANTED_DATE)/365>0;
quit;

 

 5.png

We now have a better sense of the data that can potentially be used in our analyses.

 

Next, I want to make sure that the geolocation data makes sense; although SAS University Edition doesn’t have mapping capabilities, there are still some interesting things you can do (topics for another time).

 

Here’s a scatterplot of the data:

 

6.png

 

In looking at the Latitudes / Longitudes in Google Earth, I realised that the -113.7 was North, so the scatterplot is upside down.  Realising that, you can see the path of the river that cuts through the middle of the city (bottom left through to the top right).  There are a couple of potential outliers, but nothing that sets off warning bells.

 

 7.png

 

Although it’s time consuming, doing a thorough review of your outliers allows you to have a better understanding of the data.  Questioning the users and database administrators about the data that falls into the three categories allows you as an analyst to better understand the business processes that go into the data collection, but also allows you to provide better insights for the people making decisions based on your reports.   

 

Now it’s your turn!

 

Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.

 

Need data for learning?

 

The SAS Communities Library has a growing supply of free data sources that you can use in your training to become a data scientist. The easiest way to find articles about data sources is to type "Data for learning" in the communities site search field like so:

 8.png

 

We publish all articles about free data sources under the Analytics U label in the SAS Communities Library. Want email notifications when we add new content? Subscribe to the Analytics U label by clicking "Find A Community" in the right nav and selecting SAS Communities Library at the bottom of the list. In the Labels box in the right nav, click Analytics U:

 9.png

 

Click Analytics U, then select "Subscribe" from the Options menu.

 

Happy Learning!

 

 

Comments
by Super User
on ‎02-10-2017 12:43 PM

An interesting outlier I had in one project related to age and Sexually Transmitted Disease testing. When I started working with a new project and inherited data I used some of these outlier detection approaches. I discovered that we had a spike of age at testing of 100 years. Further investigation showed that every one of these folks had gotten an STD test on their hundredth birthday.

 

Actually the actual birth date had not been recorded at the time of testing but one of the data systems required one. So the decision had been made to use the date of the test pushed back 100 years as the birthdate.

by Community Manager
on ‎02-10-2017 01:07 PM

@ballardw Clearly, it's a rite of passage for any centenarian.

by Regular Contributor
on ‎02-10-2017 08:09 PM

@ballardw thanks for the real-world example!  I have found similar experiences but nothing so dramatic - you definitely "win" the prize haha.  have a great weekend and chat soon!!

 

Chris

Your turn
Sign In!

Want to write an article? Sign in with your profile.