Hi guys,
suppose to have the following:
data DB1;
input ID :$20. (Start End)(:date9.) Absence_reason :$20.;
format Start End date9.;
cards;
0001 01JAN2015 06FEB2015 vacation
0001 02JAN2015 02JAN2015 vacation
0001 13APR2015 31DEC2015 sick leave
0002 01JAN2017 12JUL2017 vacation
0002 12JUN2017 18JUN2017 vacation
...;
I would like to remove records having the same (!) "absence reason" when the period falls into a larger one. For example for 0001, the row 02JAN2015 - 02JAN2015 should be removed because already in 01JAN2015 - 06FEB2015. The same for 0002 relative to 12JUN2017 - 18JUN2017. It is clearly an error in the file I received to perform the analysis.
Desired output:
data DB1;
input ID :$20. (Start End)(:date9.) Absence_reason :$20.;
format Start End date9.;
cards;
0001 01JAN2015 06FEB2015 vacation
0001 13APR2015 31DEC2015 sick leave
0002 01JAN2017 12JUL2017 vacation
...;
data DB1;
input ID :$20. (Start End)(:date9.) Absence_reason $20.;
format Start End date9.;
cards;
0001 01JAN2015 06FEB2015 vacation
0001 02JAN2015 02JAN2015 vacation
0001 13APR2015 31DEC2015 sick leave
0002 01JAN2017 12JUL2017 vacation
0002 12JUN2017 18JUN2017 vacation
;
data temp;
set db1;
do date=start to end;
output;
end;
drop start end;
format date date9.;
run;
proc sort data=temp out=temp2 nodupkey;
by id date Absence_reason;
run;
data temp3;
set temp2;
if id ne lag(id) or date ne lag(date)+1 or Absence_reason ne lag(Absence_reason) then group+1;
run;
proc sql;
create table want as
select group,max(id) as id,min(date) as start format=date9.,max(date) as end format=date9.,max(Absence_reason) as Absence_reason
from temp3
group by group;
quit;
Is the problem always that there are two consecutive lines and one needs to be removed? Could there ever be three consecutive or non-consecutive lines where some need to be removed?
I don't understand the result for ID 0002, they two records overlap, why are they not combined?
I don't understand your questions. Data periods should not be combined, but only removed when they are sub periods of a greater one(s) by the same absence reason. The records relative to 0002 overlap and this is the reason why I would like to remove the shortest one. Unfortunately it is a mistake made by the person that annotated the absence for the worker 0002.
Is it always two consecutive records where one needs to be removed? Could there ever be three records (consecutive or non-consecutive) where records need to be removed?
When there is partial overlap such as ID 0002, you want to keep the longest and remove the shortest?
data DB1;
input ID :$20. (Start End)(:date9.) Absence_reason $20.;
format Start End date9.;
cards;
0001 01JAN2015 06FEB2015 vacation
0001 02JAN2015 02JAN2015 vacation
0001 13APR2015 31DEC2015 sick leave
0002 01JAN2017 12JUL2017 vacation
0002 12JUN2017 18JUN2017 vacation
;
data temp;
set db1;
do date=start to end;
output;
end;
drop start end;
format date date9.;
run;
proc sort data=temp out=temp2 nodupkey;
by id date Absence_reason;
run;
data temp3;
set temp2;
if id ne lag(id) or date ne lag(date)+1 or Absence_reason ne lag(Absence_reason) then group+1;
run;
proc sql;
create table want as
select group,max(id) as id,min(date) as start format=date9.,max(date) as end format=date9.,max(Absence_reason) as Absence_reason
from temp3
group by group;
quit;
Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.