BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
NewUsrStat
Lapis Lazuli | Level 10

Hi guys, 

suppose to have the following: 

 

 

data DB1;
input ID :$20. (Start End)(:date9.) Absence_reason :$20.;
format Start End date9.;
cards;
0001 01JAN2015 06FEB2015 vacation 
0001 02JAN2015 02JAN2015 vacation 
0001 13APR2015 31DEC2015 sick leave 
0002 01JAN2017 12JUL2017 vacation 
0002 12JUN2017 18JUN2017 vacation 
...;

 

I would like to remove records having the same (!)  "absence reason" when the period falls into a larger one. For example for 0001, the row 02JAN2015 -  02JAN2015 should be removed because already in 01JAN2015 - 06FEB2015. The same for 0002 relative to 12JUN2017  -  18JUN2017. It is clearly an error in the file I received to perform the analysis. 

 

Desired output: 

 

 

data DB1;
input ID :$20. (Start End)(:date9.) Absence_reason :$20.;
format Start End date9.;
cards;
0001 01JAN2015 06FEB2015 vacation 
0001 13APR2015 31DEC2015 sick leave 
0002 01JAN2017 12JUL2017 vacation 
...;

 

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User
data DB1;
input ID :$20. (Start End)(:date9.) Absence_reason $20.;
format Start End date9.;
cards;
0001 01JAN2015 06FEB2015 vacation 
0001 02JAN2015 02JAN2015 vacation 
0001 13APR2015 31DEC2015 sick leave 
0002 01JAN2017 12JUL2017 vacation 
0002 12JUN2017 18JUN2017 vacation 
;
data temp;
 set db1;
do date=start to end;
 output;
end;
drop start end;
format date date9.;
run;
proc sort data=temp out=temp2 nodupkey;
by id date Absence_reason;
run;
data temp3;
 set temp2;
if id ne lag(id) or date ne lag(date)+1 or Absence_reason ne lag(Absence_reason) then group+1;
run;
proc sql;
create table want as
select group,max(id) as id,min(date) as start format=date9.,max(date) as end format=date9.,max(Absence_reason) as Absence_reason
 from temp3
  group by group;
quit;

View solution in original post

6 REPLIES 6
PaigeMiller
Diamond | Level 26

Is the problem always that there are two consecutive lines and one needs to be removed? Could there ever be three consecutive or non-consecutive lines where some need to be removed?

 

I don't understand the result for ID 0002, they two records overlap, why are they not combined?

--
Paige Miller
NewUsrStat
Lapis Lazuli | Level 10

I don't understand your questions. Data periods should not be combined, but only removed when they are sub periods of a greater one(s) by the same absence reason. The records relative to 0002 overlap and this is the reason why I would like to remove the shortest one. Unfortunately it is a mistake made by the person that annotated the absence for the worker 0002.

PaigeMiller
Diamond | Level 26

Is it always two consecutive records where one needs to be removed? Could there ever be three records (consecutive or non-consecutive) where records need to be removed?

 

When there is partial overlap such as ID 0002, you want to keep the longest and remove the shortest?

--
Paige Miller
NewUsrStat
Lapis Lazuli | Level 10
So: "Could there ever be 3 records....?" Yes.
"When there is partial overlap....?" Yes, the longest is the informative one I would like to retain.
Ksharp
Super User
data DB1;
input ID :$20. (Start End)(:date9.) Absence_reason $20.;
format Start End date9.;
cards;
0001 01JAN2015 06FEB2015 vacation 
0001 02JAN2015 02JAN2015 vacation 
0001 13APR2015 31DEC2015 sick leave 
0002 01JAN2017 12JUL2017 vacation 
0002 12JUN2017 18JUN2017 vacation 
;
data temp;
 set db1;
do date=start to end;
 output;
end;
drop start end;
format date date9.;
run;
proc sort data=temp out=temp2 nodupkey;
by id date Absence_reason;
run;
data temp3;
 set temp2;
if id ne lag(id) or date ne lag(date)+1 or Absence_reason ne lag(Absence_reason) then group+1;
run;
proc sql;
create table want as
select group,max(id) as id,min(date) as start format=date9.,max(date) as end format=date9.,max(Absence_reason) as Absence_reason
 from temp3
  group by group;
quit;
NewUsrStat
Lapis Lazuli | Level 10
Thank you very much! It works perfectly!