DATA Step, Macro, Functions and more

Logic Problem!!

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 101
Accepted Solution

Logic Problem!!

Hey All,

 

I am trying to figure out some elegant logic that will work for the following problem.  I have dates that come in the following three forms:

 

YYYYMMDD (20160112)

YYYYMM (201601)

YYMMDD(160112)

 

I need logic that can identify the different date forms, and then parse out the respective year, month and day.  For the YYYYMM format, the day will be automatically set to the first of the month.  I have some logic that works right now for this, but it is very hacky and will not work after the year 2020 (it involves taking advantage of the fact that SAS dates only go back to 1582.... if this is an indicator of how hacky my way is) 

 

 

I need the logic to work for all dates starting from 1900 going forward indefinitely into the future.  

 

Thanks for any help you can offer!


Accepted Solutions
Solution
‎01-12-2016 05:57 PM
Trusted Advisor
Posts: 1,118

Re: Logic Problem!!

Posted in reply to mahler_ji

As @Rick_SAS and others have pointed out very clearly, any kind of "logic" based on the date string in the filename alone will fail as soon as you have two identical date strings with different meanings. However, if you restrict the set of potential dates as far as possible (exclude Sundays, holidays, ... whatever) you may be able to exclude at least some ambiguous cases. The following code could be helpful to get a complete overview of the ambiguous cases. It starts with all possible date strings in the range 01JAN1975 - 31DEC2074. Feel free to adapt this date range as you like and insert selection or exclusion criteria as appropriate.

/* Create all possible date strings in a range */

data alldates;
do d='01JAN1975'd to '31DEC2074'd; /* adapt the date range as you like */
  dc=put(d, yymmddn8.); f='1'; output; /* YYYYMMDD */
  if day(d)=1 then do;
    dc=put(d, yymmn6.); f='2'; output; /* YYYYMM   */
  end;
  dc=put(d, yymmddn6.); f='3'; output; /* YYMMDD   */
end;
format d date9.;
run;

/* Select the ambiguous cases, determine max. number of interpretations */

proc sql;
create table ambig0 as
select *, count(*) as c
from alldates
group by dc
having c>1
order by dc, d;

select max(c) into :mc
from ambig0;
quit;

/* Enhance and restructure dataset of ambiguous cases */

data ambig;
do until(last.dc);
  set ambig0;
  by dc;
  array date[&mc];
  length fcomb $%eval(2*&mc-1);
  fcomb=catx(',',fcomb,f);
  i=sum(i,1);
  date[i]=d;
end;
drop d f i;
format date: date9.;
run;

As you can see, with the above date range there are 144 ambiguous cases, all involving two possible interpretations, the second and third format and the year 2020. Now you can concentrate on these cases:

  • How likely is it to receive financial data from the years 2001 - 2012 only in 2020?
  • Wouldn't a dataset labeled, say, "June 2012" contain differently structured/aggregated data than one labeled "06 Dec 2020" anyway (both "labels" showing as '201206', of course)?
  • If the file contains date values, does the filename matter at all?
  • ...

View solution in original post


All Replies
Super User
Posts: 19,850

Re: Logic Problem!!

Posted in reply to mahler_ji

I'm assuming you don't have another field that helps you to identify which of the three cases your dealing with?

 

Can you post some sample data and expected output?

 

 

Respected Advisor
Posts: 4,930

Re: Logic Problem!!

Posted in reply to mahler_ji

Indefinitely? Like after year 2099?

PG
Frequent Contributor
Posts: 101

Re: Logic Problem!!

I would be happy if it made it to 2099.  I have posted below what my current logic is, and you can see why I am going to currently fail at 2020.

 

If you can help me get to 2099, that would be amazing. 

Super User
Posts: 11,343

Re: Logic Problem!!

Posted in reply to mahler_ji

Does the field always start with the year? If not, good luck.

Is there a begining year value associated with your data?

Your comment about the 1582 makes be queestion how you know 160112 isn't in the form YYYYMM.

 

And you really should show what you currently have so we don't propose the same solution you are using.

Frequent Contributor
Posts: 101

Re: Logic Problem!!

Hey @ballardw,

 

I'm sorry, I should have been more specific.  The three date formats that I posted are the only three that are present:

 

YYYYMMDD

YYYYMM
YYMMDD

 

The problem that you mention, about descerning between 160115 is precisely how I ran into this problem.  I will include my current logic below:

 

%let inMonthIndex = %eval(&InYearIndex. + 4);
%let inDayIndex = %eval(&InYearIndex. + 6);


data file_list;
set file_list;
format QueryDate DATE9.;
informat QueryDate DATE9.;
format DataDate DATE9.;
informat DataDate DATE9.;
QueryDate = Date+&InQueryDateLag.;
MonthNum = substr(filename,&InMonthIndex.,2)*1;
DayNum = substr(filename,&InDayIndex.,2)*1;

/*HOKEY WAY TO DEAL WITH FILES THAT DON'T HAVE A MONTH AND DAY in FILENAME*/
if MonthNum < 1 or MonthNum > 12 then Month = "01"; else Month = substr(filename,&InMonthIndex.,2) ;
if DayNum < 1 or DayNum > 31 then Day = "01"; else Day = substr(filename,&InDayIndex.,2);

DataDate = mdy(Month,Day,substr(filename,&InYearIndex.,4));
is6digit = 0;
if DataDate = . or year(datadate) < 1915 then do;
DataDate = mdy( substr(filename,&InMonthIndex.-2,2),substr(filename,&InDayIndex.-2,2),substr(filename,&InYearIndex.,2) );
is6digit = 1;
end;
drop DayNum MonthNum Day month;
run;

 

Obviously I define &YearIndex earlier, (this takes care of filenames that begin with a prefix, like JM20160112, at which point the index would be 3)  This is an input into the macro.

 

You can see that my logic where I infer that the date is YYMMDD when the date is less than 1915, which is how I get rid of thinking that the date is 1601.  However, this will break when the clock strikes the year 2020.

 

Let me know if I can provide more information.

 

Contributor
Posts: 56

Re: Logic Problem!!

[ Edited ]
Posted in reply to mahler_ji

Is your data in character form? If not, I would say convert it as such to a character.

 

Then, you can take advantage of the ANYDTDTE format on an input statement.

 

For example,

 

data have;
  input date_raw $;
datalines;
20160112
  201601
  160112
;
run;

data want;
  set have;
  date_n = input(date_raw,anydtdte8.);
  format date_n mmddyy10.;
run;

I think that will handle Case 1 and Case 3.

 

For Case 2, I would expect a missing value to be returned for date_n (SAS can't get full information from just the year and month). 

 

For those missing rows, you could just concatenate on a "01" (or whatever day of the month you wanted to represent these as), and re-run against the ANYDTDTE format.

 

data want2;
  set want;
  if missing(date_n) then date_n = input(strip(date_raw)||"01",anydtdte8.);
run;

Or, you can combine these steps:

 

data want;
  set have;
  date_n = coalesce(input(date_raw,anydtdte8.),input(strip(date_raw)||"01",anydtdte8.));
  format date_n mmddyy10.;
run;

The coalesce function will return the first non-missing value of the two input statements. The first argument will only be missing when SAS doesn't have year, month, and day that it can find.

 

A few final notes: this format depends on your LOCALE, e.g. is 16/01/12 YMD , MDY, etc. Check your system option for DATESTYLE and make sure it is set to YMD (to fit your examples).

 

Also, ask yourself: 16/01/12. Assuming we know this is YEAR/MONTH/DAY, is the year 1916? 2016? 2116? You may need to explore one further piece to this (YEARCUTOFF= option).

 

So given only what you have, it is not as straight forward and elegant as you probably hope... And it by no means is "indefinitely" into the future.

 

 

 

 

 

 

SAS Super FREQ
Posts: 3,755

Re: Logic Problem!!

Posted in reply to mahler_ji

In general, this is an impossible problem because no logic will ever be able to determine whether

191211

is NOV 1912 or DEC 11, 1919.

Similarly, you can't tell if

200611

means NOV 2006 or May 11, 2020  (or maybe 1920?).

 

If you know that your YEAR will always be bigger than 1920 and less than 2020, there might be hope.

Frequent Contributor
Posts: 101

Re: Logic Problem!!

This is exactly the problem that I am having.  Once we get to 2020, my logic falls apart.  The logic that I have is wrapped up in very complex macros, but it looks like this:

 

%let inMonthIndex = %eval(&InYearIndex. + 4);
%let inDayIndex = %eval(&InYearIndex. + 6);


data file_list;
set file_list;
format QueryDate DATE9.;
informat QueryDate DATE9.;
format DataDate DATE9.;
informat DataDate DATE9.;
QueryDate = Date+&InQueryDateLag.;
MonthNum = substr(filename,&InMonthIndex.,2)*1;
DayNum = substr(filename,&InDayIndex.,2)*1;

/*HOKEY WAY TO DEAL WITH FILES THAT DON'T HAVE A MONTH AND DAY in FILENAME*/
if MonthNum < 1 or MonthNum > 12 then Month = "01"; else Month = substr(filename,&InMonthIndex.,2) ;
if DayNum < 1 or DayNum > 31 then Day = "01"; else Day = substr(filename,&InDayIndex.,2);

DataDate = mdy(Month,Day,substr(filename,&InYearIndex.,4));
is6digit = 0;
if DataDate = . or year(datadate) < 1913 then do;
DataDate = mdy( substr(filename,&InMonthIndex.-2,2),substr(filename,&InDayIndex.-2,2),substr(filename,&InYearIndex.,2) );
is6digit = 1;
end;
drop DayNum MonthNum Day month;
run;

 

Obviously I assign the macro variable YearIndex earlier in the macro (this is to deal with situations where there are prefixes in front of the year, such as JM20010101, in which case the year index would be 3)

 

You can see that I intentionally create a null variable for dates coming before 2016, and then use the hard coded limit of 1900 to stop sas from thinking that I am talking about something in the middle ages.

 

That is why, when 2020 hits, this logic will fail.

 

 

 

SAS Super FREQ
Posts: 3,755

Re: Logic Problem!!

Posted in reply to mahler_ji

Why don't you describe the problem you are trying to solve? There might be better approaches. For example, reading the timestamp on the file itself rather than trying to deduce it from a file name.

Frequent Contributor
Posts: 101

Re: Logic Problem!!

I am sorry if I am not being clear, but I am dealing with financial data where the timestamp on the file rarely ever matches up with the date that actually corresponds to the data.  Thus, I am having to use the filename.

 

The problem that I have is, I have a bunch of different sources of data, all use different filenames, each contain the date in one of the three formats I mentioned.  Only those three formats are present.

 

I just need to find a way to deduce what form the date is in.  

Super User
Posts: 19,850

Re: Logic Problem!!

Posted in reply to mahler_ji

Can you identify the date format type based on the source? All of the sources use three different formats? 

Frequent Contributor
Posts: 101

Re: Logic Problem!!

[ Edited ]

Hey @Reeza,

 

Yes, I could do that, but this is a macro that I want to be flexible enough to be used for all data sources that we currently have, as well as those that we will have in the future.  So I cannot hard code in the specific identifying characteristics of each dataset in order to parse the date.  This macro is widely used in our code structure, so it needs to be very flexible in handling filenames, as long as the date is in one of those three forms.

 

I am sorry, I know this is hard, but i have been thinking about it for a long time, so I thought I would bring it to you all for some help.

 

 

Super User
Posts: 19,850

Re: Logic Problem!!

Posted in reply to mahler_ji

As mentioned by others, I don't think you'll find a robust solution enough to handle all your scenarios. I would assume that there's something else in the file name that would indicate the format expected which should clarify the rest of your code.

 

 

Super User
Posts: 11,343

Re: Logic Problem!!

Posted in reply to mahler_ji

Are there dates in the content of the file that are better formed? Since you plan on putting this into a macro you could read any of the input data sets into a temporary set and then pull date information form the content to use in naming the output set in a consistent manner.

 

Also depending upon your particular ACTUAL data needs, the year problem might be set by used of the YEARCUTOFF system option. Your current non-specified issue with 2020 could be arising from the current default setting of yearcutoff=1920, which means any 2 digit year from 20 to 99 is assumed to be 1920-1999 with 00 to 19 as 2000 to 2019. A value of Yearcutoff=1950 means 50-99 would be 1950-1999 and 00-49 would be treated as 2000 to 2049.

If you have a suspicion about the data sets to be processed you can fine tune that for some datasets and then change for the next set. Add code like

options yearcutoff=1900;

to process all 2 digit years as 19xx.

and then change to another value when processing later files.

 

And I would not spend a LOT of time trying to make any of this specifically useful for more than about 50 years into the future as there will be other complications to arise by then. Maybe by then we get the folks in the financial (and other sectors) to stop using 2 digit years entirely....

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 18 replies
  • 610 views
  • 9 likes
  • 7 in conversation