To the clever SAS users of the world, I need your help. The solution might come to me on Monday but nothing was coming to mind by the end of a long week.
My problem: I have HR data with records showing an employees’ duration in a given position. Typically a person is assigned the role for a long or indefinite period, but when the person is away, another employee might be assigned the position for a short period.
HAVE: | WANT: | |||||||
start_date | end_date | position | employee | start_date | end_date | position | employee | |
1-Sep-14 | 31-Dec-99 | 348 | 574 | 1-Sep-14 | 20-Mar-16 | 348 | 574 | |
21-Mar-16 | 6-Jun-16 | 348 | 635 | 21-Mar-16 | 6-Jun-16 | 348 | 635 | |
7-Jun-16 | 31-Dec-99 | 348 | 574 | |||||
1-Aug-13 | 31-Dec-99 | 554 | 963 | 1-Aug-13 | 20-Jul-14 | 554 | 963 | |
21-Jul-14 | 8-Jul-16 | 554 | 580 | 21-Jul-14 | 1-Nov-15 | 554 | 580 | |
2-Nov-15 | 27-Nov-15 | 554 | 78 | 2-Nov-15 | 27-Nov-15 | 554 | 78 | |
28-Nov-15 | 8-Jul-16 | 554 | 580 | |||||
9-Jul-16 | 31-Dec-99 | 554 | 963 | |||||
3-Sep-09 | 4-Apr-15 | 742 | 830 | 3-Sep-09 | 4-Apr-15 | 742 | 830 | |
21-May-15 | 24-Jun-15 | 742 | 38 | 21-May-15 | 10-Jun-15 | 742 | 38 | |
11-Jun-15 | 31-Dec-99 | 742 | 313 | 11-Jun-15 | 31-Dec-99 | 742 | 313 | |
23-Nov-15 | 11-Dec-15 | 751 | 44 | 23-Nov-15 | 11-Dec-15 | 751 | 44 | |
12-Dec-15 | 31-Dec-99 | 751 | 125 | 12-Dec-15 | 7-Apr-16 | 751 | 125 | |
8-Apr-16 | 22-Apr-16 | 751 | 44 | 8-Apr-16 | 22-Apr-16 | 751 | 44 | |
23-Apr-16 | 31-Dec-99 | 751 | 125 | |||||
10-Mar-16 | 23-Mar-16 | 231 | 67 | 10-Mar-16 | 23-Mar-16 | 231 | 67 | |
24-Mar-16 | 6-Apr-16 | 231 | 723 | 24-Mar-16 | 6-Apr-16 | 231 | 723 | |
7-Apr-16 | 31-Dec-99 | 231 | 32 | 7-Apr-16 | 20-Apr-16 | 231 | 839 | |
7-Apr-16 | 20-Apr-16 | 231 | 839 | 21-Apr-16 | 31-Dec-99 | 231 | 32 |
I want to convert these overlapping records into a single time series so it’s clear who was/is in a position on a given date.
The situations can become complicated. And ambiguous. The example data illustrates the appropriate assumptions.
I didn't think it would be that difficult but I soon realised first. last. processing probably won't do. Using the LAG function looked impractical, and dangerous, when mixed with too many IF-THEN-ELSE statements. Look-ahead processing with extra SET statements might be the way. And maybe a little array processing could be handy. Perhaps Proc FCMP and a recursive routine. I can even imagine a SQL solution - from an SQL guru.
Is anyone up to the challange? I'm taking a break.
TIA
That is really not easy. There are too many scenarios you need to consider about. I listed all the scenarios, you need to pick up which one you should keep. I love this question. NOTE: I convert 31-Dec-99 into today() , you can convert it back if you need to. data have; input start_date : date9. end : $20. position employee ; if upcase(end)='31-DEC-99' then end_date=today(); else end_date=input(end,date9.); drop end; format start_date end_date date9.; cards; 1-Sep-14 31-Dec-99 348 574 21-Mar-16 6-Jun-16 348 635 1-Aug-13 31-Dec-99 554 963 21-Jul-14 8-Jul-16 554 580 2-Nov-15 27-Nov-15 554 78 3-Sep-09 4-Apr-15 742 830 21-May-15 24-Jun-15 742 38 11-Jun-15 31-Dec-99 742 313 23-Nov-15 11-Dec-15 751 44 12-Dec-15 31-Dec-99 751 125 8-Apr-16 22-Apr-16 751 44 10-Mar-16 23-Mar-16 231 67 24-Mar-16 6-Apr-16 231 723 7-Apr-16 31-Dec-99 231 32 7-Apr-16 20-Apr-16 231 839 ; run; data temp; set have; do date=start_date to end_date; output; end; keep date position employee; format date date9.; run; proc freq data=have noprint; tables employee/out=employee(keep=employee); run; data _null_; set employee end=last; if _n_=1 then call execute('data all;merge '); call execute(catt('temp(where=(emp_',employee,'=',employee, ') rename=(employee=emp_',employee,'))')); if last then call execute(';by position date;run;'); run; proc transpose data=all(keep=emp_: obs=0) out=vname; run; options missing=' '; data _null_; set vname end=last; if _n_=1 then call execute('data all;length employees $ 20000;retain employees;set all;if '); call execute(catt(_name_,' ne lag(',_name_,')')); if last then call execute(' then do;group+1;employees=catx("|",of emp_:);end;run;'); else call execute(' or '); run; data want; set all; by group; retain start_date; if first.group then start_date=date; if last.group then do; end_date=date;output; end; format start_date end_date date9.; drop emp_: date; run; proc print data=want noobs; run;
Treat it as medicine cabinet problem.
Take each record and create a record for every day.
Later employees take priority, when duplicated.
Then use first/last to create your want dataset.
A remark to @Reeza's answer - you'll need to sort derived data
by: position start_date before selecting the last.start_date
That is really not easy. There are too many scenarios you need to consider about. I listed all the scenarios, you need to pick up which one you should keep. I love this question. NOTE: I convert 31-Dec-99 into today() , you can convert it back if you need to. data have; input start_date : date9. end : $20. position employee ; if upcase(end)='31-DEC-99' then end_date=today(); else end_date=input(end,date9.); drop end; format start_date end_date date9.; cards; 1-Sep-14 31-Dec-99 348 574 21-Mar-16 6-Jun-16 348 635 1-Aug-13 31-Dec-99 554 963 21-Jul-14 8-Jul-16 554 580 2-Nov-15 27-Nov-15 554 78 3-Sep-09 4-Apr-15 742 830 21-May-15 24-Jun-15 742 38 11-Jun-15 31-Dec-99 742 313 23-Nov-15 11-Dec-15 751 44 12-Dec-15 31-Dec-99 751 125 8-Apr-16 22-Apr-16 751 44 10-Mar-16 23-Mar-16 231 67 24-Mar-16 6-Apr-16 231 723 7-Apr-16 31-Dec-99 231 32 7-Apr-16 20-Apr-16 231 839 ; run; data temp; set have; do date=start_date to end_date; output; end; keep date position employee; format date date9.; run; proc freq data=have noprint; tables employee/out=employee(keep=employee); run; data _null_; set employee end=last; if _n_=1 then call execute('data all;merge '); call execute(catt('temp(where=(emp_',employee,'=',employee, ') rename=(employee=emp_',employee,'))')); if last then call execute(';by position date;run;'); run; proc transpose data=all(keep=emp_: obs=0) out=vname; run; options missing=' '; data _null_; set vname end=last; if _n_=1 then call execute('data all;length employees $ 20000;retain employees;set all;if '); call execute(catt(_name_,' ne lag(',_name_,')')); if last then call execute(' then do;group+1;employees=catx("|",of emp_:);end;run;'); else call execute(' or '); run; data want; set all; by group; retain start_date; if first.group then start_date=date; if last.group then do; end_date=date;output; end; format start_date end_date date9.; drop emp_: date; run; proc print data=want noobs; run;
Thanks @Reeza, @Shmuel and @Ksharp.
It should have been obvious (to me) that the obvious solution just needed to break down the data to individual days.
@Ksharp I appreciate the effort you put into your solution, so you have to get the Accepted Solution. (Glad you enjoyed the challange )
I figured though, that I could assume the latest start_date takes precedence so could manage with a slightly simpler solution.
Thanks again everyone.
data have;
input start_date : date9. end : $20. position employee ;
if upcase(end)='31-DEC-99' then end_date=today();
else end_date=input(end,date9.);
drop end;
format start_date end_date date9.;
cards;
1-Sep-15 31-Dec-99 348 574
21-Mar-16 6-Jun-16 348 635
1-Aug-13 31-Dec-99 554 963
21-Jul-14 8-Jul-16 554 580
2-Nov-15 27-Nov-15 554 78
3-Sep-14 4-Apr-15 742 830
9-Apr-15 28-Apr-15 742 830
21-May-15 24-Jun-15 742 38
11-Jun-15 31-Dec-99 742 313
23-Nov-15 11-Dec-15 751 44
12-Dec-15 31-Dec-99 751 125
8-Apr-16 22-Apr-16 751 44
10-Mar-16 23-Mar-16 231 67
24-Mar-16 6-Apr-16 231 723
7-Apr-16 31-Dec-99 231 32
7-Apr-16 20-Apr-16 231 839
;
run;
data temp;
set have;
do date=start_date to end_date;
output;
end;
keep position date employee start_date end_date;
format date date9.;
run;
proc sort ;
by position date descending start_date end_date;
run;
data temp2;
set temp;
by position date descending start_date ;
if first.date;
run;
data want;
keep position start_dt end_dt emp;
set temp2;
by position ;
retain position emp last_emp last_date start_dt end_dt;
format start_dt end_dt date9.;
if first.position then
do;
last_emp = employee;
start_dt= date;
last_date = date;
if last.position then
do;
emp = employee;
end_dt = date;
output;
end;
else
return;
end;
if employee = last_emp and date = last_date+1 then
last_date = date;
else
link output_rec;
if last.position then
link output_rec;
return;
output_rec:
emp = last_emp;
end_dt = last_date;
output;
last_emp = employee;
last_date = date;
start_dt = date;
return;
run;
Just to display an alternative checked code, maybe easier to understand.
On my platform 31_dec_99 it is interpreted as 31DEC1999 therefore today() was prefered .
For my covenient i used ddmmyy10. format instead date9. - you may change it.
data have;
input start_date : date9. end_date : date9.
position employee ;
if end_date = '31DEC99'd then end_date=today();
format start_date end_date date9.;
cards;
1-Sep-14 31-Dec-99 348 574
21-Mar-16 6-Jun-16 348 635
1-Aug-13 31-Dec-99 554 963
21-Jul-14 8-Jul-16 554 580
2-Nov-15 27-Nov-15 554 78
3-Sep-09 4-Apr-15 742 830
21-May-15 24-Jun-15 742 38
11-Jun-15 31-Dec-99 742 313
23-Nov-15 11-Dec-15 751 44
12-Dec-15 31-Dec-99 751 125
8-Apr-16 22-Apr-16 751 44
10-Mar-16 23-Mar-16 231 67
24-Mar-16 6-Apr-16 231 723
7-Apr-16 31-Dec-99 231 32
7-Apr-16 20-Apr-16 231 839
;
run;
proc sort data=have out=temp1;
by position start_date employee;
run;
data temp2(keep=position date priority employee);
set temp1;
by position start_date employee;
retain priority;
if first.position then priority = 0;
if first.employee then priority+1;
do date=start_date to end_date; output; end;
format date ddmmyy10.;
run;
proc sort data=temp2 out=temp3;
by position date priority ;
run;
data temp4(drop=priority);
set temp3;
by position notsorted date;
if last.date;
run;
data want(drop=date);
set temp4;
by position notsorted employee;
retain start_date;
format start_date end_date ddmmyy10.;
if first.employee then start_date = date;
if last.employee then do;
end_date = date;
output;
end;
run;
Sory for being not free to develop it earlier.
Anyhow it was a chalenge.
Thanks - it's certainly neater.
Slight amendment needed though, to handle when an employee has separate spells in a position, without someone else in between.
I've made small amendments accordingly, highlighted below.
data have;
input start_date : date9. end_date : date9.
position employee ;
if end_date = '31DEC99'd then end_date=today();
format start_date end_date date9.;
cards;
1-Sep-14 31-Dec-99 348 574
21-Mar-16 6-Jun-16 348 635
1-Aug-13 31-Dec-99 554 963
21-Jul-14 8-Jul-16 554 580
2-Nov-15 27-Nov-15 554 78
3-Sep-09 4-Apr-15 742 830
21-May-15 24-Jun-15 742 830
11-Jun-15 31-Dec-99 742 313
23-Nov-15 11-Dec-15 751 44
12-Dec-15 31-Dec-99 751 125
8-Apr-16 22-Apr-16 751 44
10-Mar-16 23-Mar-16 231 67
24-Mar-16 6-Apr-16 231 723
7-Apr-16 31-Dec-99 231 32
7-Apr-16 20-Apr-16 231 839
;
run;
proc sort data=have out=temp1;
by position start_date employee;
run;
data temp2(keep=position date priority employee);
set temp1;
by position start_date employee;
retain priority;
if first.position then priority = 0;
if first.employee then priority+1;
do date=start_date to end_date; output; end;
format date ddmmyy10.;
run;
proc sort data=temp2 out=temp3;
by position date priority ;
run;
data temp4 /* (drop=priority) */;
set temp3;
by position notsorted date;
if last.date;
run;
data want(drop=date priority);
set temp4;
by position notsorted employee priority;
retain start_date;
format start_date end_date ddmmyy10.;
if first.priority then start_date = date;
if last.priority then do;
end_date = date;
output;
end;
run;
Thanks again.
Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.
Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.