Re: Trying to condense date timeframes in a flatfile

djohn051 · Posted 10-11-2017 09:44 PM

I currently have a set of 16 start dates and 16 end dates, per observation. For the sake of this example let's pretend that it is 3 repeating variables, a start time when a pay period began and an end time, when a pay period ended:

start_1 end_1 start_2 end_2 start_3 end_3

What I am interested in doing is collapsing end dates and the subsequent start date when they are less than 30 days apart:

start_1: Jan 1 end_1: Jan 15 start_2: Jan 31 end_2: Feb 16 start_3: May 1 end_3: May 12

This would become:

new_start_1: Jan 1 new_end_1: Feb 16 new_start_2: May 1 new_end_2: May 12

But, simultaneously, there could be another observation that looks like this, where all of the distances are > or equal to 30:

start_1: Jan 1 end_1: Jan 15 start_2: Feb 31 end_2: Mar 16 start_3: May 1 end_3: May 12

This would essentially stay the same, but to maintain uniformity would need to be transferred to the new variable set:

new_start_1: Jan 1 new_end_1: Jan 15 new_start_2: Feb 31 new_end_2: Mar 16 new_start_3: May 1 3 new_end_3: May 12

Some additional points:

- This data was a created flatfile, so the 16th set of start and end variables is for the person that had the most pay periods. Some individuals have only start_1 and end_1 filled and the remaining are missing.

- All of these dates ascend chronologically.

What do you recommend as my most efficient way to do this?

Kurt_Bremser · Posted 10-12-2017 02:41 AM

You should start by transposing into a long format, so you have individual records with only one start and end. Then it's only a sequential scan and using the lag() function to compare with the previous record. Keeping data in a long format removes all requirements for knowing the maximum number of periods, and it frees all the space used by missing values.

I even guess that your data was in a long format somewhere along the way.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

djohn051 · Posted 10-12-2017 06:51 AM

That sounds much more feasible. Could you help me with writing the formula for that? Essentially my data would now be three columns: (id) (start) (end) - what I would want to do, within each id, is check if the (observation n+1 start date) - (end date) is <30 or GE 30.

If it is GE 30 then (start date) (end date) line is kept and the process is rerun for (observation n+1 start date) by subtracting (observation n+2 start date) - (observation n+1 end date).

If it is <30, then (observation start date) and (observation end date n+1) replaces (observation end date). Of course it would need to check each subsequent (start date n+#) - (end date) until it does find one GE 30 and then all data in between would be removed and, for example, your new line might be (observation start date) with the end date that came from (observation end date n+5).

How would you do this with the lag function?

Thank you!

Kurt_Bremser · Posted 10-12-2017 07:26 AM

This is the code that I would use:

data want;
set have (rename=(start=_start end=_end)); /* rename start and end to get them out of the way */
by id;
retain start; /* keep new start variable across observations */
format start end date9.;
oldend = lag(_end); /* do this before everything else, so lag() is never called conditionally */
if first.id
then start = _start; /* initialize */
else do;
  if _start - oldend ge 30
  then do;
    end = oldend; /* take end from previous record */
    output;
    start = _start; /* initialize again */
  end;
end;
if last.id
then do; /* cleanup at end of id group */
  end = _end;
  output;
end;
drop _start _end oldend; /* get rid of old and intermediate variables */
run;

Edit: changed the comparison to ge instead of just >

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

djohn051 · Posted 10-12-2017 08:59 AM

Thank you very much. I will give this a try!

Trying to condense date timeframes in a flatfile