10-11-2017 09:44 PM
I currently have a set of 16 start dates and 16 end dates, per observation. For the sake of this example let's pretend that it is 3 repeating variables, a start time when a pay period began and an end time, when a pay period ended:
What I am interested in doing is collapsing end dates and the subsequent start date when they are less than 30 days apart:
start_1: Jan 1 end_1: Jan 15 start_2: Jan 31 end_2: Feb 16 start_3: May 1 end_3: May 12
This would become:
new_start_1: Jan 1 new_end_1: Feb 16 new_start_2: May 1 new_end_2: May 12
But, simultaneously, there could be another observation that looks like this, where all of the distances are > or equal to 30:
start_1: Jan 1 end_1: Jan 15 start_2: Feb 31 end_2: Mar 16 start_3: May 1 end_3: May 12
This would essentially stay the same, but to maintain uniformity would need to be transferred to the new variable set:
new_start_1: Jan 1 new_end_1: Jan 15 new_start_2: Feb 31 new_end_2: Mar 16 new_start_3: May 1 3 new_end_3: May 12
Some additional points:
- This data was a created flatfile, so the 16th set of start and end variables is for the person that had the most pay periods. Some individuals have only start_1 and end_1 filled and the remaining are missing.
- All of these dates ascend chronologically.
What do you recommend as my most efficient way to do this?
10-12-2017 02:41 AM
You should start by transposing into a long format, so you have individual records with only one start and end. Then it's only a sequential scan and using the lag() function to compare with the previous record. Keeping data in a long format removes all requirements for knowing the maximum number of periods, and it frees all the space used by missing values.
I even guess that your data was in a long format somewhere along the way.
10-12-2017 06:51 AM
10-12-2017 07:26 AM - edited 10-12-2017 07:31 AM
This is the code that I would use:
data want; set have (rename=(start=_start end=_end)); /* rename start and end to get them out of the way */ by id; retain start; /* keep new start variable across observations */ format start end date9.; oldend = lag(_end); /* do this before everything else, so lag() is never called conditionally */ if first.id then start = _start; /* initialize */ else do; if _start - oldend ge 30 then do; end = oldend; /* take end from previous record */ output; start = _start; /* initialize again */ end; end; if last.id then do; /* cleanup at end of id group */ end = _end; output; end; drop _start _end oldend; /* get rid of old and intermediate variables */ run;
Edit: changed the comparison to ge instead of just >