Hi Experts,
Suppose I have a couple of data sets like the following(this is just a sample data):
data have1;
format dates MMDDYY10.;
input dates DATE9.;
cards;
14jan2001
5apr2801
6apr2011
5jul2511
6jul2011
14aug2511
RUN;
data have2;
format dates MMDDYY10.;
input dates DATE9.;
cards;
14jan2021
5apr2911
6apr3201
5jul2011
6jul2912
14aug2501
RUN;
HAVE1:
HAVE2:
2011 and 2021 are the only valid dates, let's suppose. Somebody keyed the wrong dates and the database rules did not catch them while processing to store them.
Please consider the real data is too large and the unusual dates (like 2501, 3201, 2911,2801, 2511 in the above datasets are the outliers ) are buried under millions of rows. I tried with proc means, proc univariate, proc frequency and even proc sgplot to see how they behave. Proc means was the easiest and closest to desired outcome. I used proc means after getting unique rows( sorted them and applied nodupkey), tabulated with proc tabulate. Butt it only gave limited data/rows(I omitted median, mean etc. intentionally)-one each for min and max and that was not enough to capture all the unusual dates. I wanted all the outliers to be captured but it didn't.
Is there any better procedure to do this ? Please share any useful URLs and links that are helpful for an ordinary (not advanced) SAS programmer.
I appreciate your help on this.
I don't view this as a statistical problem that can be solved with statistical tools.
I view this as a definition/logical problem. You determine what rules you want to set up to catch the "unusual" date, and then you can write code to find these unusual dates.
If you're checking for gross errors in year, I would use PROC FREQ:
proc freq data=have;
tables mydate;
format mydate year4.;
run;
Nothing beats seeing the values.
If you know in advance what the valid years are, then you can use a DATA step, e.g.
data oops;
set have;
if year(mydate) NOT IN (2021,2022);
run;
Start by defining a valid date range. What is the earliest possible date? What is the latest possible date? Let's assume your earliest possible date is 01 Jan 2010 and your latest date is yesterday, then you can flag invalid dates like this:
if date < '01Jan2010'd or date >= today() then Invalid_Date_Flag = 'Y';
I have spent a LOT of time catching unusual data.
There are many different types some are easier to spot than others.
Simple: Is the value one of a list of values, think a typical survey question with responses of A, B, C or D or a 1 to 5 scale). Extensions of this County or City names in an area or that your company has branches in.
These may be addressed at the time data is read with custom informats telling you when invalid values occur.
Slightly more complex: Known (or expected) range of values. 2010 le year(datevariable) le 2021 for example.
Instrument or measured data often falls into this category.
More interesting are things like seasonal differences. Air temperature of 120 F in summer might not be excessive but unexpected in winter.
The "statistical" part of these might be finding historical ranges to set triggers.
The real headaches come from referential data. Is this persons name associated with this account actually different/same from a different account? Is this street name valid in SomeCityName, USA, how about the number if the street is okay.
A good starting reference: https://www.amazon.com/Codys-Cleaning-Techniques-Using-Third/dp/1629607967 which actually uses SAS as the software.
If you have millions of rows then some "statistical" approach could work - like all rows with a year that exists in less than 0.5% of all rows. That's something you could capture using Prof Freq or even a SAS data step.
Or if this is some regular process where you have history data that's already validated then you could also compare distribution of current to history. For this I don't have the necessary stats background but I'm sure someone could provide the guidance if that's your situation. Using history data would have the advantage to also capture situation where all the data is wrong.
And of course using other information you've got like expected ranges always helps.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.