I need to consolidate observations based on consecutive validity ranges.
Here is a reproducible minimal example:
data have;
infile datalines4 delimiter=",";
input id $ start :date9. end :date9. var1 var2 var3;
format start end date9.;
datalines4;
ABC1,01DEC2020,21FEB2021,1,0,1
ABC1,22FEB2021,10MAR2021,1,0,1
ABC1,11MAR2021,31DEC9999,1,1,1
DEF2,11DEC2020,10JAN2021,0,0,0
DEF2,11JAN2021,19JAN2021,1,1,1
DEF2,20JAN2021,02FEB2021,1,1,1
DEF2,03FEB2021,10MAR2025,1,0,1
;;;;
quit;
For each id, if the validity dates are consecutive (i.e. end_date + 1day
of the n-1
observation is equal to the start_date
of the n
observation) and if they share the same value for all three variables (var1, var2 and var3), I want to consolidate those multiple observations into a single one.
What I would like to have:
Any kind of approach (array/hash/sql) would be appreciated !
data HAVE;
infile datalines4 delimiter=",";
input ID $ START :date9. END :date9. VAR1 VAR2 VAR3;
format START END date9.;
datalines4;
ABC1,01DEC2020,21FEB2021,1,0,1
ABC1,22FEB2021,10MAR2021,1,0,1
ABC1,11MAR2021,31DEC9999,1,1,1
DEF2,11DEC2020,10JAN2021,0,0,0
DEF2,11JAN2021,19JAN2021,1,1,1
DEF2,20JAN2021,01FEB2021,1,1,1
DEF2,02FEB2021,02FEB2021,1,1,1
DEF2,03FEB2021,10MAR2025,1,0,1
;;;;
run;
data temp;
set have;
do date=start to end;
output;
end;
drop start end;
format date date9.;
run;
proc sort data=temp out=temp1 nodupkey;
by id VAR1 VAR2 VAR3 date;
run;
data temp2;
set temp1;
by id VAR1 VAR2 VAR3;
if first.var3 or dif(date) ne 1 then group+1;
run;
proc summary data=temp2;
by id VAR1 VAR2 VAR3;
var date;
output out=want min=start max=end;
run;
Like this?
data HAVE;
infile datalines4 delimiter=",";
input ID $ START :date9. END :date9. VAR1 VAR2 VAR3;
format START END date9.;
datalines4;
ABC1,01DEC2020,21FEB2021,1,0,1
ABC1,22FEB2021,10MAR2021,1,0,1
ABC1,11MAR2021,31DEC9999,1,1,1
DEF2,11DEC2020,10JAN2021,0,0,0
DEF2,11JAN2021,19JAN2021,1,1,1
DEF2,20JAN2021,01FEB2021,1,1,1
DEF2,02FEB2021,02FEB2021,1,1,1
DEF2,03FEB2021,10MAR2025,1,0,1
run;
data WANT;
set HAVE end=LASTOBS;
if ^LASTOBS then set HAVE
(firstobs=2
rename=(ID=_ID START=_START END=_END VAR1=_VAR1 VAR2=_VAR2 VAR3=_VAR3)) ;
_SAMEVALUES= ( ID=_ID & VAR1=_VAR1 & VAR2=_VAR2 & VAR3=_VAR3);
if LASTOBS | ^_SAMEVALUES then do;
START=coalesce(_S,START);
output;
_S=.;
end;
else if ^_S then _S=START;
retain _S;
drop _:;
run;
ID | START | END | VAR1 | VAR2 | VAR3 |
---|---|---|---|---|---|
ABC1 | 01DEC2020 | 10MAR2021 | 1 | 0 | 1 |
ABC1 | 11MAR2021 | 31DEC9999 | 1 | 1 | 1 |
DEF2 | 11DEC2020 | 10JAN2021 | 0 | 0 | 0 |
DEF2 | 11JAN2021 | 02FEB2021 | 1 | 1 | 1 |
DEF2 | 03FEB2021 | 10MAR2025 | 1 | 0 | 1 |
data HAVE;
infile datalines4 delimiter=",";
input ID $ START :date9. END :date9. VAR1 VAR2 VAR3;
format START END date9.;
datalines4;
ABC1,01DEC2020,21FEB2021,1,0,1
ABC1,22FEB2021,10MAR2021,1,0,1
ABC1,11MAR2021,31DEC9999,1,1,1
DEF2,11DEC2020,10JAN2021,0,0,0
DEF2,11JAN2021,19JAN2021,1,1,1
DEF2,20JAN2021,01FEB2021,1,1,1
DEF2,02FEB2021,02FEB2021,1,1,1
DEF2,03FEB2021,10MAR2025,1,0,1
;;;;
run;
data temp;
set have;
do date=start to end;
output;
end;
drop start end;
format date date9.;
run;
proc sort data=temp out=temp1 nodupkey;
by id VAR1 VAR2 VAR3 date;
run;
data temp2;
set temp1;
by id VAR1 VAR2 VAR3;
if first.var3 or dif(date) ne 1 then group+1;
run;
proc summary data=temp2;
by id VAR1 VAR2 VAR3;
var date;
output out=want min=start max=end;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.