BookmarkSubscribeRSS Feed
inquistive
Quartz | Level 8

Hi Experts,

 

Suppose I have a couple of data sets like the following(this is just a sample data):

data have1;
format dates MMDDYY10.;
input dates DATE9.;
cards;
14jan2001
5apr2801
6apr2011
5jul2511
6jul2011
14aug2511
RUN;

data have2;
format dates MMDDYY10.;
input dates DATE9.;
cards;
14jan2021
5apr2911
6apr3201
5jul2011
6jul2912
14aug2501
RUN;

HAVE1:

inquistive_0-1655931427890.png

HAVE2:

inquistive_1-1655931460330.png

2011 and 2021 are the only valid dates, let's suppose. Somebody keyed the wrong dates and the database rules did not catch them while processing to store them.

Please consider the real data is too large and the unusual dates (like  2501, 3201, 2911,2801, 2511 in the above datasets are the outliers )  are buried under millions of rows. I tried with proc means, proc univariate, proc frequency and even  proc sgplot to see how they behave. Proc means was the easiest and closest to desired outcome. I used proc means after  getting unique rows( sorted them and applied nodupkey), tabulated with proc tabulate. Butt it only gave limited data/rows(I omitted median, mean etc. intentionally)-one each for min and max and that was not enough to capture all the unusual dates. I wanted all the outliers to be captured but it didn't. 

Is there any better procedure to do this ? Please share any useful URLs and links that are helpful for an ordinary (not advanced) SAS programmer.

I appreciate your help on this.  

6 REPLIES 6
PaigeMiller
Diamond | Level 26

I don't view this as a statistical problem that can be solved with statistical tools.

 

I view this as a definition/logical problem. You determine what rules you want to set up to catch the "unusual" date, and then you can write code to find these unusual dates.

--
Paige Miller
Quentin
Super User

If you're checking for gross errors in year, I would use PROC FREQ:

 

proc freq data=have;
  tables mydate;
  format mydate year4.;
run;

Nothing beats seeing the values.

 

If you know in advance what the valid years are, then you can use a DATA step, e.g.

data oops;
  set have;
  if year(mydate) NOT IN (2021,2022);
run;
The Boston Area SAS Users Group (BASUG) is hosting our in person SAS Blowout on Oct 18!
This full-day event in Cambridge, Mass features four presenters from SAS, presenting on a range of SAS 9 programming topics. Pre-registration by Oct 15 is required.
Full details and registration info at https://www.basug.org/events.
SASKiwi
PROC Star

Start by defining a valid date range. What is the earliest possible date? What is the latest possible date? Let's assume your earliest possible date is 01 Jan 2010 and your latest date is yesterday, then you can flag invalid dates like this:

if date < '01Jan2010'd or date >= today() then Invalid_Date_Flag = 'Y';  
ballardw
Super User

I have spent a LOT of time catching unusual data.

There are many different types some are easier to spot than others.

 

Simple: Is the value one of a list of values, think a typical survey question with responses of A, B, C or D or a 1 to 5 scale). Extensions of this County or City names in an area or that your company has branches in.

These may be addressed at the time data is read with custom informats telling you when invalid values occur.

 

Slightly more complex: Known (or expected) range of values.       2010 le year(datevariable) le 2021 for example.

Instrument or measured data often falls into this category.

 

More interesting are things like seasonal differences. Air temperature of 120 F in summer might not be excessive but unexpected in winter.

 

The "statistical" part of these might be finding historical ranges to set triggers.

 

The real headaches come from referential data. Is this persons name associated with this account actually different/same from a different account? Is this street name valid in SomeCityName, USA, how about the number if the street is okay.

 

A good starting reference:  https://www.amazon.com/Codys-Cleaning-Techniques-Using-Third/dp/1629607967 which actually uses SAS as the software.

 

Quentin
Super User
Just want to second the recommendation of Ron Cody's book. It will teach you a ton about data cleaning, and also a ton about SAS programming in general. I think this book should be required reading for SAS programmers.
The Boston Area SAS Users Group (BASUG) is hosting our in person SAS Blowout on Oct 18!
This full-day event in Cambridge, Mass features four presenters from SAS, presenting on a range of SAS 9 programming topics. Pre-registration by Oct 15 is required.
Full details and registration info at https://www.basug.org/events.
Patrick
Opal | Level 21

If you have millions of rows then some "statistical" approach could work - like all rows with a year that exists in less than 0.5% of all rows. That's something you could capture using Prof Freq or even a SAS data step.

 

Or if this is some regular process where you have history data that's already validated then you could also compare distribution of current to history. For this I don't have the necessary stats background but I'm sure someone could provide the guidance if that's your situation. Using history data would have the advantage to also capture situation where all the data is wrong.
And of course using other information you've got like expected ranges always helps.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 874 views
  • 13 likes
  • 6 in conversation