To the clever SAS users of the world, I need your help. The solution might come to me on Monday but nothing was coming to mind by the end of a long week.
My problem: I have HR data with records showing an employees’ duration in a given position. Typically a person is assigned the role for a long or indefinite period, but when the person is away, another employee might be assigned the position for a short period.
| HAVE: | WANT: | |||||||
| start_date | end_date | position | employee | start_date | end_date | position | employee | |
| 1-Sep-14 | 31-Dec-99 | 348 | 574 | 1-Sep-14 | 20-Mar-16 | 348 | 574 | |
| 21-Mar-16 | 6-Jun-16 | 348 | 635 | 21-Mar-16 | 6-Jun-16 | 348 | 635 | |
| 7-Jun-16 | 31-Dec-99 | 348 | 574 | |||||
| 1-Aug-13 | 31-Dec-99 | 554 | 963 | 1-Aug-13 | 20-Jul-14 | 554 | 963 | |
| 21-Jul-14 | 8-Jul-16 | 554 | 580 | 21-Jul-14 | 1-Nov-15 | 554 | 580 | |
| 2-Nov-15 | 27-Nov-15 | 554 | 78 | 2-Nov-15 | 27-Nov-15 | 554 | 78 | |
| 28-Nov-15 | 8-Jul-16 | 554 | 580 | |||||
| 9-Jul-16 | 31-Dec-99 | 554 | 963 | |||||
| 3-Sep-09 | 4-Apr-15 | 742 | 830 | 3-Sep-09 | 4-Apr-15 | 742 | 830 | |
| 21-May-15 | 24-Jun-15 | 742 | 38 | 21-May-15 | 10-Jun-15 | 742 | 38 | |
| 11-Jun-15 | 31-Dec-99 | 742 | 313 | 11-Jun-15 | 31-Dec-99 | 742 | 313 | |
| 23-Nov-15 | 11-Dec-15 | 751 | 44 | 23-Nov-15 | 11-Dec-15 | 751 | 44 | |
| 12-Dec-15 | 31-Dec-99 | 751 | 125 | 12-Dec-15 | 7-Apr-16 | 751 | 125 | |
| 8-Apr-16 | 22-Apr-16 | 751 | 44 | 8-Apr-16 | 22-Apr-16 | 751 | 44 | |
| 23-Apr-16 | 31-Dec-99 | 751 | 125 | |||||
| 10-Mar-16 | 23-Mar-16 | 231 | 67 | 10-Mar-16 | 23-Mar-16 | 231 | 67 | |
| 24-Mar-16 | 6-Apr-16 | 231 | 723 | 24-Mar-16 | 6-Apr-16 | 231 | 723 | |
| 7-Apr-16 | 31-Dec-99 | 231 | 32 | 7-Apr-16 | 20-Apr-16 | 231 | 839 | |
| 7-Apr-16 | 20-Apr-16 | 231 | 839 | 21-Apr-16 | 31-Dec-99 | 231 | 32 | |
I want to convert these overlapping records into a single time series so it’s clear who was/is in a position on a given date.
The situations can become complicated. And ambiguous. The example data illustrates the appropriate assumptions.
I didn't think it would be that difficult but I soon realised first. last. processing probably won't do. Using the LAG function looked impractical, and dangerous, when mixed with too many IF-THEN-ELSE statements. Look-ahead processing with extra SET statements might be the way. And maybe a little array processing could be handy. Perhaps Proc FCMP and a recursive routine. I can even imagine a SQL solution - from an SQL guru.
Is anyone up to the challange? I'm taking a break.
TIA
That is really not easy. There are too many scenarios you need to consider about.
I listed all the scenarios, you need to pick up which one you should keep.
I love this question.
NOTE: I convert 31-Dec-99 into today() , you can convert it back if you need to.
data have;
input start_date : date9. end : $20.   position    employee ;
if upcase(end)='31-DEC-99' then end_date=today();
 else end_date=input(end,date9.);
drop end;
format start_date  end_date date9.;
cards;
1-Sep-14    31-Dec-99   348 574                
21-Mar-16   6-Jun-16    348 635                
1-Aug-13    31-Dec-99   554 963                
21-Jul-14   8-Jul-16    554 580                
2-Nov-15    27-Nov-15   554 78                 
3-Sep-09    4-Apr-15    742 830                
21-May-15   24-Jun-15   742 38                 
11-Jun-15   31-Dec-99   742 313                
23-Nov-15   11-Dec-15   751 44                 
12-Dec-15   31-Dec-99   751 125                
8-Apr-16    22-Apr-16   751 44                 
10-Mar-16   23-Mar-16   231 67                 
24-Mar-16   6-Apr-16    231 723                
7-Apr-16    31-Dec-99   231 32                 
7-Apr-16    20-Apr-16   231 839
;
run;
data temp;
 set have;
 do date=start_date to end_date;
  output;
 end;
 keep date position    employee;
 format date date9.;
run;
proc freq data=have noprint;
 tables employee/out=employee(keep=employee);
run;
data _null_;
 set employee end=last;
 if _n_=1 then call execute('data all;merge ');
 call execute(catt('temp(where=(emp_',employee,'=',employee,
 ') rename=(employee=emp_',employee,'))'));
 if last then call execute(';by position date;run;');
run;
proc transpose data=all(keep=emp_: obs=0) out=vname;
run;
options missing=' ';
data _null_;
 set vname end=last;
 if _n_=1 then 
 call execute('data all;length employees $ 20000;retain employees;set all;if '); 
 call execute(catt(_name_,' ne lag(',_name_,')'));
 if last then 
 call execute(' then do;group+1;employees=catx("|",of emp_:);end;run;');
  else call execute(' or ');
run;
data want;
 set all;
 by group;
 retain start_date;
 if first.group then start_date=date;
 if last.group then do;
  end_date=date;output;
 end;
 format start_date end_date date9.;
 drop emp_: date;
run;
proc print data=want noobs;
run;
  
					
				
			
			
				
			
			
			
				
			
			
			
			
			
		Treat it as medicine cabinet problem.
Take each record and create a record for every day.
Later employees take priority, when duplicated.
Then use first/last to create your want dataset.
A remark to @Reeza's answer - you'll need to sort derived data
by: position start_date before selecting the last.start_date
That is really not easy. There are too many scenarios you need to consider about.
I listed all the scenarios, you need to pick up which one you should keep.
I love this question.
NOTE: I convert 31-Dec-99 into today() , you can convert it back if you need to.
data have;
input start_date : date9. end : $20.   position    employee ;
if upcase(end)='31-DEC-99' then end_date=today();
 else end_date=input(end,date9.);
drop end;
format start_date  end_date date9.;
cards;
1-Sep-14    31-Dec-99   348 574                
21-Mar-16   6-Jun-16    348 635                
1-Aug-13    31-Dec-99   554 963                
21-Jul-14   8-Jul-16    554 580                
2-Nov-15    27-Nov-15   554 78                 
3-Sep-09    4-Apr-15    742 830                
21-May-15   24-Jun-15   742 38                 
11-Jun-15   31-Dec-99   742 313                
23-Nov-15   11-Dec-15   751 44                 
12-Dec-15   31-Dec-99   751 125                
8-Apr-16    22-Apr-16   751 44                 
10-Mar-16   23-Mar-16   231 67                 
24-Mar-16   6-Apr-16    231 723                
7-Apr-16    31-Dec-99   231 32                 
7-Apr-16    20-Apr-16   231 839
;
run;
data temp;
 set have;
 do date=start_date to end_date;
  output;
 end;
 keep date position    employee;
 format date date9.;
run;
proc freq data=have noprint;
 tables employee/out=employee(keep=employee);
run;
data _null_;
 set employee end=last;
 if _n_=1 then call execute('data all;merge ');
 call execute(catt('temp(where=(emp_',employee,'=',employee,
 ') rename=(employee=emp_',employee,'))'));
 if last then call execute(';by position date;run;');
run;
proc transpose data=all(keep=emp_: obs=0) out=vname;
run;
options missing=' ';
data _null_;
 set vname end=last;
 if _n_=1 then 
 call execute('data all;length employees $ 20000;retain employees;set all;if '); 
 call execute(catt(_name_,' ne lag(',_name_,')'));
 if last then 
 call execute(' then do;group+1;employees=catx("|",of emp_:);end;run;');
  else call execute(' or ');
run;
data want;
 set all;
 by group;
 retain start_date;
 if first.group then start_date=date;
 if last.group then do;
  end_date=date;output;
 end;
 format start_date end_date date9.;
 drop emp_: date;
run;
proc print data=want noobs;
run;
  
					
				
			
			
				
			
			
			
			
			
			
			
		Thanks @Reeza, @Shmuel and @Ksharp.
It should have been obvious (to me) that the obvious solution just needed to break down the data to individual days.
@Ksharp I appreciate the effort you put into your solution, so you have to get the Accepted Solution. (Glad you enjoyed the challange  )
 )
I figured though, that I could assume the latest start_date takes precedence so could manage with a slightly simpler solution.
Thanks again everyone.
data have;
input start_date : date9. end : $20.   position    employee ;
if upcase(end)='31-DEC-99' then end_date=today();
else end_date=input(end,date9.);
drop end;
format start_date  end_date date9.;
cards;
1-Sep-15    31-Dec-99   348 574               
21-Mar-16   6-Jun-16    348 635               
1-Aug-13    31-Dec-99   554 963               
21-Jul-14   8-Jul-16    554 580               
2-Nov-15    27-Nov-15   554 78                
3-Sep-14    4-Apr-15    742 830               
9-Apr-15    28-Apr-15   742 830               
21-May-15   24-Jun-15   742 38                
11-Jun-15   31-Dec-99   742 313               
23-Nov-15   11-Dec-15   751 44                
12-Dec-15   31-Dec-99   751 125               
8-Apr-16    22-Apr-16   751 44                
10-Mar-16   23-Mar-16   231 67                
24-Mar-16   6-Apr-16    231 723               
7-Apr-16    31-Dec-99   231 32                 
7-Apr-16    20-Apr-16   231 839
;
run;
data temp;
set have;
do date=start_date to end_date;
  output;
end;
keep  position  date  employee start_date end_date;
format date date9.;
run;
proc sort ;
   by position date descending start_date end_date;
run; 
data temp2;
   set temp;
   by position date descending start_date ;
   if first.date;
run;
 
data want;
  keep position start_dt end_dt emp;
  set temp2;
  by position  ;
  retain position emp last_emp last_date start_dt end_dt;
  format start_dt end_dt date9.;
  if first.position then
    do;
     last_emp = employee;
     start_dt= date;
     last_date = date;
     if last.position then
       do;
        emp = employee;
        end_dt = date;
        output;
       end;
     else
       return; 
     end;
 
  if employee = last_emp  and  date = last_date+1 then
     last_date = date;
  else
    link output_rec;
 
  if last.position then
     link output_rec;
return;
 
output_rec:
     emp = last_emp;
     end_dt = last_date;
     output;
     last_emp = employee;
     last_date = date;
     start_dt = date;
return;
 
run;
 
Just to display an alternative checked code, maybe easier to understand.
On my platform 31_dec_99 it is interpreted as 31DEC1999 therefore today() was prefered .
For my covenient i used ddmmyy10. format instead date9. - you may change it.
data have;
input start_date : date9. end_date : date9. 
 position employee ;
if end_date = '31DEC99'd then end_date=today(); 
format start_date end_date date9.;
cards;
1-Sep-14 31-Dec-99 348 574 
21-Mar-16 6-Jun-16 348 635 
1-Aug-13 31-Dec-99 554 963 
21-Jul-14 8-Jul-16 554 580 
2-Nov-15 27-Nov-15 554 78 
3-Sep-09 4-Apr-15 742 830 
21-May-15 24-Jun-15 742 38 
11-Jun-15 31-Dec-99 742 313 
23-Nov-15 11-Dec-15 751 44 
12-Dec-15 31-Dec-99 751 125 
8-Apr-16 22-Apr-16 751 44 
10-Mar-16 23-Mar-16 231 67 
24-Mar-16 6-Apr-16 231 723 
7-Apr-16 31-Dec-99 231 32 
7-Apr-16 20-Apr-16 231 839
;
run;
proc sort data=have out=temp1;
  by position start_date employee;
run;
data temp2(keep=position date priority employee);
 set temp1;
  by position start_date employee; 
       retain priority;
      if first.position then priority = 0;
      if first.employee then priority+1;
      do date=start_date to end_date; output; end;
      format date ddmmyy10.;
run;
proc sort data=temp2 out=temp3;
  by position date priority ;
run;
data temp4(drop=priority);
  set temp3;
   by position notsorted date;
       if last.date;
run;
data want(drop=date);
  set temp4;
   by position notsorted employee;
        retain start_date;
        format start_date end_date ddmmyy10.;
        if first.employee then start_date = date;
        if last.employee then do;
           end_date = date;
           output;
       end;
run; 
Sory for being not free to develop it earlier.
Anyhow it was a chalenge.
Thanks - it's certainly neater.
Slight amendment needed though, to handle when an employee has separate spells in a position, without someone else in between.
I've made small amendments accordingly, highlighted below.
data have;
input start_date : date9. end_date : date9. 
 position employee ;
if end_date = '31DEC99'd then end_date=today(); 
format start_date end_date date9.;
cards;
1-Sep-14 31-Dec-99 348 574 
21-Mar-16 6-Jun-16 348 635 
1-Aug-13 31-Dec-99 554 963 
21-Jul-14 8-Jul-16 554 580 
2-Nov-15 27-Nov-15 554 78 
3-Sep-09 4-Apr-15 742 830 
21-May-15 24-Jun-15 742 830 
11-Jun-15 31-Dec-99 742 313 
23-Nov-15 11-Dec-15 751 44 
12-Dec-15 31-Dec-99 751 125 
8-Apr-16 22-Apr-16 751 44 
10-Mar-16 23-Mar-16 231 67 
24-Mar-16 6-Apr-16 231 723 
7-Apr-16 31-Dec-99 231 32 
7-Apr-16 20-Apr-16 231 839
;
run;
proc sort data=have out=temp1;
  by position start_date employee;
run;
data temp2(keep=position date priority employee);
 set temp1;
  by position start_date employee; 
       retain priority;
      if first.position then priority = 0;
      if first.employee then priority+1;
      do date=start_date to end_date; output; end;
      format date ddmmyy10.;
run;
proc sort data=temp2 out=temp3;
  by position date priority ;
run;
data temp4 /* (drop=priority) */;
  set temp3;
   by position notsorted date;
   if last.date;
run;
data want(drop=date priority);
  set temp4;
   by position notsorted employee priority;
        retain start_date;
        format start_date end_date ddmmyy10.;
        if first.priority then start_date = date;
        if last.priority then do;
           end_date = date;
           output;
       end;
run;
Thanks again.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
