To the clever SAS users of the world, I need your help. The solution might come to me on Monday but nothing was coming to mind by the end of a long week.
My problem: I have HR data with records showing an employees’ duration in a given position. Typically a person is assigned the role for a long or indefinite period, but when the person is away, another employee might be assigned the position for a short period.
HAVE: | WANT: | |||||||
start_date | end_date | position | employee | start_date | end_date | position | employee | |
1-Sep-14 | 31-Dec-99 | 348 | 574 | 1-Sep-14 | 20-Mar-16 | 348 | 574 | |
21-Mar-16 | 6-Jun-16 | 348 | 635 | 21-Mar-16 | 6-Jun-16 | 348 | 635 | |
7-Jun-16 | 31-Dec-99 | 348 | 574 | |||||
1-Aug-13 | 31-Dec-99 | 554 | 963 | 1-Aug-13 | 20-Jul-14 | 554 | 963 | |
21-Jul-14 | 8-Jul-16 | 554 | 580 | 21-Jul-14 | 1-Nov-15 | 554 | 580 | |
2-Nov-15 | 27-Nov-15 | 554 | 78 | 2-Nov-15 | 27-Nov-15 | 554 | 78 | |
28-Nov-15 | 8-Jul-16 | 554 | 580 | |||||
9-Jul-16 | 31-Dec-99 | 554 | 963 | |||||
3-Sep-09 | 4-Apr-15 | 742 | 830 | 3-Sep-09 | 4-Apr-15 | 742 | 830 | |
21-May-15 | 24-Jun-15 | 742 | 38 | 21-May-15 | 10-Jun-15 | 742 | 38 | |
11-Jun-15 | 31-Dec-99 | 742 | 313 | 11-Jun-15 | 31-Dec-99 | 742 | 313 | |
23-Nov-15 | 11-Dec-15 | 751 | 44 | 23-Nov-15 | 11-Dec-15 | 751 | 44 | |
12-Dec-15 | 31-Dec-99 | 751 | 125 | 12-Dec-15 | 7-Apr-16 | 751 | 125 | |
8-Apr-16 | 22-Apr-16 | 751 | 44 | 8-Apr-16 | 22-Apr-16 | 751 | 44 | |
23-Apr-16 | 31-Dec-99 | 751 | 125 | |||||
10-Mar-16 | 23-Mar-16 | 231 | 67 | 10-Mar-16 | 23-Mar-16 | 231 | 67 | |
24-Mar-16 | 6-Apr-16 | 231 | 723 | 24-Mar-16 | 6-Apr-16 | 231 | 723 | |
7-Apr-16 | 31-Dec-99 | 231 | 32 | 7-Apr-16 | 20-Apr-16 | 231 | 839 | |
7-Apr-16 | 20-Apr-16 | 231 | 839 | 21-Apr-16 | 31-Dec-99 | 231 | 32 |
I want to convert these overlapping records into a single time series so it’s clear who was/is in a position on a given date.
The situations can become complicated. And ambiguous. The example data illustrates the appropriate assumptions.
I didn't think it would be that difficult but I soon realised first. last. processing probably won't do. Using the LAG function looked impractical, and dangerous, when mixed with too many IF-THEN-ELSE statements. Look-ahead processing with extra SET statements might be the way. And maybe a little array processing could be handy. Perhaps Proc FCMP and a recursive routine. I can even imagine a SQL solution - from an SQL guru.
Is anyone up to the challange? I'm taking a break.
TIA
That is really not easy. There are too many scenarios you need to consider about. I listed all the scenarios, you need to pick up which one you should keep. I love this question. NOTE: I convert 31-Dec-99 into today() , you can convert it back if you need to. data have; input start_date : date9. end : $20. position employee ; if upcase(end)='31-DEC-99' then end_date=today(); else end_date=input(end,date9.); drop end; format start_date end_date date9.; cards; 1-Sep-14 31-Dec-99 348 574 21-Mar-16 6-Jun-16 348 635 1-Aug-13 31-Dec-99 554 963 21-Jul-14 8-Jul-16 554 580 2-Nov-15 27-Nov-15 554 78 3-Sep-09 4-Apr-15 742 830 21-May-15 24-Jun-15 742 38 11-Jun-15 31-Dec-99 742 313 23-Nov-15 11-Dec-15 751 44 12-Dec-15 31-Dec-99 751 125 8-Apr-16 22-Apr-16 751 44 10-Mar-16 23-Mar-16 231 67 24-Mar-16 6-Apr-16 231 723 7-Apr-16 31-Dec-99 231 32 7-Apr-16 20-Apr-16 231 839 ; run; data temp; set have; do date=start_date to end_date; output; end; keep date position employee; format date date9.; run; proc freq data=have noprint; tables employee/out=employee(keep=employee); run; data _null_; set employee end=last; if _n_=1 then call execute('data all;merge '); call execute(catt('temp(where=(emp_',employee,'=',employee, ') rename=(employee=emp_',employee,'))')); if last then call execute(';by position date;run;'); run; proc transpose data=all(keep=emp_: obs=0) out=vname; run; options missing=' '; data _null_; set vname end=last; if _n_=1 then call execute('data all;length employees $ 20000;retain employees;set all;if '); call execute(catt(_name_,' ne lag(',_name_,')')); if last then call execute(' then do;group+1;employees=catx("|",of emp_:);end;run;'); else call execute(' or '); run; data want; set all; by group; retain start_date; if first.group then start_date=date; if last.group then do; end_date=date;output; end; format start_date end_date date9.; drop emp_: date; run; proc print data=want noobs; run;
Treat it as medicine cabinet problem.
Take each record and create a record for every day.
Later employees take priority, when duplicated.
Then use first/last to create your want dataset.
A remark to @Reeza's answer - you'll need to sort derived data
by: position start_date before selecting the last.start_date
That is really not easy. There are too many scenarios you need to consider about. I listed all the scenarios, you need to pick up which one you should keep. I love this question. NOTE: I convert 31-Dec-99 into today() , you can convert it back if you need to. data have; input start_date : date9. end : $20. position employee ; if upcase(end)='31-DEC-99' then end_date=today(); else end_date=input(end,date9.); drop end; format start_date end_date date9.; cards; 1-Sep-14 31-Dec-99 348 574 21-Mar-16 6-Jun-16 348 635 1-Aug-13 31-Dec-99 554 963 21-Jul-14 8-Jul-16 554 580 2-Nov-15 27-Nov-15 554 78 3-Sep-09 4-Apr-15 742 830 21-May-15 24-Jun-15 742 38 11-Jun-15 31-Dec-99 742 313 23-Nov-15 11-Dec-15 751 44 12-Dec-15 31-Dec-99 751 125 8-Apr-16 22-Apr-16 751 44 10-Mar-16 23-Mar-16 231 67 24-Mar-16 6-Apr-16 231 723 7-Apr-16 31-Dec-99 231 32 7-Apr-16 20-Apr-16 231 839 ; run; data temp; set have; do date=start_date to end_date; output; end; keep date position employee; format date date9.; run; proc freq data=have noprint; tables employee/out=employee(keep=employee); run; data _null_; set employee end=last; if _n_=1 then call execute('data all;merge '); call execute(catt('temp(where=(emp_',employee,'=',employee, ') rename=(employee=emp_',employee,'))')); if last then call execute(';by position date;run;'); run; proc transpose data=all(keep=emp_: obs=0) out=vname; run; options missing=' '; data _null_; set vname end=last; if _n_=1 then call execute('data all;length employees $ 20000;retain employees;set all;if '); call execute(catt(_name_,' ne lag(',_name_,')')); if last then call execute(' then do;group+1;employees=catx("|",of emp_:);end;run;'); else call execute(' or '); run; data want; set all; by group; retain start_date; if first.group then start_date=date; if last.group then do; end_date=date;output; end; format start_date end_date date9.; drop emp_: date; run; proc print data=want noobs; run;
Thanks @Reeza, @Shmuel and @Ksharp.
It should have been obvious (to me) that the obvious solution just needed to break down the data to individual days.
@Ksharp I appreciate the effort you put into your solution, so you have to get the Accepted Solution. (Glad you enjoyed the challange )
I figured though, that I could assume the latest start_date takes precedence so could manage with a slightly simpler solution.
Thanks again everyone.
data have;
input start_date : date9. end : $20. position employee ;
if upcase(end)='31-DEC-99' then end_date=today();
else end_date=input(end,date9.);
drop end;
format start_date end_date date9.;
cards;
1-Sep-15 31-Dec-99 348 574
21-Mar-16 6-Jun-16 348 635
1-Aug-13 31-Dec-99 554 963
21-Jul-14 8-Jul-16 554 580
2-Nov-15 27-Nov-15 554 78
3-Sep-14 4-Apr-15 742 830
9-Apr-15 28-Apr-15 742 830
21-May-15 24-Jun-15 742 38
11-Jun-15 31-Dec-99 742 313
23-Nov-15 11-Dec-15 751 44
12-Dec-15 31-Dec-99 751 125
8-Apr-16 22-Apr-16 751 44
10-Mar-16 23-Mar-16 231 67
24-Mar-16 6-Apr-16 231 723
7-Apr-16 31-Dec-99 231 32
7-Apr-16 20-Apr-16 231 839
;
run;
data temp;
set have;
do date=start_date to end_date;
output;
end;
keep position date employee start_date end_date;
format date date9.;
run;
proc sort ;
by position date descending start_date end_date;
run;
data temp2;
set temp;
by position date descending start_date ;
if first.date;
run;
data want;
keep position start_dt end_dt emp;
set temp2;
by position ;
retain position emp last_emp last_date start_dt end_dt;
format start_dt end_dt date9.;
if first.position then
do;
last_emp = employee;
start_dt= date;
last_date = date;
if last.position then
do;
emp = employee;
end_dt = date;
output;
end;
else
return;
end;
if employee = last_emp and date = last_date+1 then
last_date = date;
else
link output_rec;
if last.position then
link output_rec;
return;
output_rec:
emp = last_emp;
end_dt = last_date;
output;
last_emp = employee;
last_date = date;
start_dt = date;
return;
run;
Just to display an alternative checked code, maybe easier to understand.
On my platform 31_dec_99 it is interpreted as 31DEC1999 therefore today() was prefered .
For my covenient i used ddmmyy10. format instead date9. - you may change it.
data have;
input start_date : date9. end_date : date9.
position employee ;
if end_date = '31DEC99'd then end_date=today();
format start_date end_date date9.;
cards;
1-Sep-14 31-Dec-99 348 574
21-Mar-16 6-Jun-16 348 635
1-Aug-13 31-Dec-99 554 963
21-Jul-14 8-Jul-16 554 580
2-Nov-15 27-Nov-15 554 78
3-Sep-09 4-Apr-15 742 830
21-May-15 24-Jun-15 742 38
11-Jun-15 31-Dec-99 742 313
23-Nov-15 11-Dec-15 751 44
12-Dec-15 31-Dec-99 751 125
8-Apr-16 22-Apr-16 751 44
10-Mar-16 23-Mar-16 231 67
24-Mar-16 6-Apr-16 231 723
7-Apr-16 31-Dec-99 231 32
7-Apr-16 20-Apr-16 231 839
;
run;
proc sort data=have out=temp1;
by position start_date employee;
run;
data temp2(keep=position date priority employee);
set temp1;
by position start_date employee;
retain priority;
if first.position then priority = 0;
if first.employee then priority+1;
do date=start_date to end_date; output; end;
format date ddmmyy10.;
run;
proc sort data=temp2 out=temp3;
by position date priority ;
run;
data temp4(drop=priority);
set temp3;
by position notsorted date;
if last.date;
run;
data want(drop=date);
set temp4;
by position notsorted employee;
retain start_date;
format start_date end_date ddmmyy10.;
if first.employee then start_date = date;
if last.employee then do;
end_date = date;
output;
end;
run;
Sory for being not free to develop it earlier.
Anyhow it was a chalenge.
Thanks - it's certainly neater.
Slight amendment needed though, to handle when an employee has separate spells in a position, without someone else in between.
I've made small amendments accordingly, highlighted below.
data have;
input start_date : date9. end_date : date9.
position employee ;
if end_date = '31DEC99'd then end_date=today();
format start_date end_date date9.;
cards;
1-Sep-14 31-Dec-99 348 574
21-Mar-16 6-Jun-16 348 635
1-Aug-13 31-Dec-99 554 963
21-Jul-14 8-Jul-16 554 580
2-Nov-15 27-Nov-15 554 78
3-Sep-09 4-Apr-15 742 830
21-May-15 24-Jun-15 742 830
11-Jun-15 31-Dec-99 742 313
23-Nov-15 11-Dec-15 751 44
12-Dec-15 31-Dec-99 751 125
8-Apr-16 22-Apr-16 751 44
10-Mar-16 23-Mar-16 231 67
24-Mar-16 6-Apr-16 231 723
7-Apr-16 31-Dec-99 231 32
7-Apr-16 20-Apr-16 231 839
;
run;
proc sort data=have out=temp1;
by position start_date employee;
run;
data temp2(keep=position date priority employee);
set temp1;
by position start_date employee;
retain priority;
if first.position then priority = 0;
if first.employee then priority+1;
do date=start_date to end_date; output; end;
format date ddmmyy10.;
run;
proc sort data=temp2 out=temp3;
by position date priority ;
run;
data temp4 /* (drop=priority) */;
set temp3;
by position notsorted date;
if last.date;
run;
data want(drop=date priority);
set temp4;
by position notsorted employee priority;
retain start_date;
format start_date end_date ddmmyy10.;
if first.priority then start_date = date;
if last.priority then do;
end_date = date;
output;
end;
run;
Thanks again.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.