Hello, I am trying to count the number of company-specific holidays an employee has worked between two different dates.
There is a Holiday table which lists the company holidays, starting from 2013 and going through 2025. So, there is 12-25-2013, 12-25-2014, etc. Then there are the holidays that change date every year, such as Thanksgiving and the day after Thanksgiving, etc. I cannot use the SAS holiday function because it includes days my company does not take off.
HOLIDAY_ID HOLIDAY_DATE
1 12/25/2013
2 01/01/2014
3 05/27/2014
There is a table with employees working different time spans. There are more columns involved, but essentially, it looks like this:
EMP_ID WORK_START WORK_END_DATE
1 12/24/2013 05/26/2014
2 12/24/2013 06/01/2014
3 01/06/2014 03/23/2014
Basically, for each employee (EMP_ID) I need the number of rows in the Holiday table equal to the number of holidays between WORK_START and WORK_END DATE. Aka "Want" table looks like
EMP_ID WORK_START WORK_END_DATE HOLIDAY_COUNT
1 12/24/2013 05/26/2014 2
2 12/24/2013 06/01/2014 3
3 01/06/2014 03/23/2014 0
The employee work table is pretty large.
I'm sure this is something that's been solved before and is probably simple, but I cannot find a solution online. Thank you!
Like this?
proc sql;
select EMP_ID, WORK_START, WORK_END_DATE, count(HOLIDAY_DATE) as HOLIDAY_COUNT
from EMPLOYEE left join HOLIDAYS
on HOLIDAY_DATE between WORK_START and WORK_END_DATE
@calger wrote:
Won't this result in cross join that takes forever to execute? Is there a
better way?
Well the criteria don't include equijoins or other proximity conditions, so a cartesian join is unavoidable in this case.
This does not mean it takes forever. What are the volumes?
It takes 50 seconds for 1m times 800 records on my old PC.
data EMP;
do ID=1 to 1e6;
START=ranuni(1)*20000;
END =START+ranuni(1)*5000;
output;
end;
run;
data HOL;
do DAY=1 to 25000;
if ranuni(1)>.97 then output;
end;
run;
proc sql _method;
create table WANT as
select ID, START, END, count(DAY)
from EMP left join HOL
on DAY between START and END
group by ID, START, END;
quit;
The good thing about using an array is you can stop the cartesian match before it is fully complete, when the holidays come after the end date. For example this runs faster than the proc SQL.
data _null_;
call symputx('num_holidays', NUM_HOLIDAYS);
if 0 then set HOL nobs=NUM_HOLIDAYS;
run;
data WANT2;
set EMP;
array HOLIDAYS[&num_holidays.] _temporary_;
if _N_ = 1 then do I = 1 to NUM_HOLIDAYS;
set HOL nobs= NUM_HOLIDAYS;
HOLIDAYS[I] = DAY;
end;
do I = 1 to dim(HOLIDAYS) until (HOLIDAYS[I] > END);
HOLIDAY_COUNT = sum(HOLIDAY_COUNT, START <= HOLIDAYS[I] <= END );
end;
keep ID START END HOLIDAY_COUNT;
run;
Another solution: using an array as lookup:
data _null_;
call symputx('num_holidays', num_holidays);
if 0 then set work.holidays nobs=num_holidays;
run;
data want;
set work.emps;
length holiday_count 8;
array holidays[&num_holidays.] _temporary_;
if _n_ = 1 then do;
do i = 1 to num_holidays;
set work.holidays nobs= num_holidays;
holidays[i] = holiday_date;
end;
end;
holiday_count = 0;
do i = 1 to dim(holidays);
holiday_count = holiday_count + (work_start <= holidays[i] <= work_end);
end;
drop i holiday_date holiday_id;
run;
data holiday;
input HOLIDAY_ID HOLIDAY_DATE : mmddyy12.;
format HOLIDAY_DATE mmddyy10. ;
cards;
1 12/25/2013
2 01/01/2014
3 05/27/2014
;
data have;
input EMP_ID WORK_START : mmddyy12. WORK_END_DATE : mmddyy12.;
format WORK_START WORK_END_DATE mmddyy10. ;
cards;
1 12/24/2013 05/26/2014
2 12/24/2013 06/01/2014
3 01/06/2014 03/23/2014
;
data want;
if _n_=1 then do;
if 0 then set holiday(keep=HOLIDAY_DATE);
declare hash h(dataset:'holiday(keep=HOLIDAY_DATE)');
h.definekey('HOLIDAY_DATE');
h.definedone();
end;
set have;
HOLIDAY_COUNT=0;
do i=WORK_START to WORK_END_DATE;
if h.check(key:i)=0 then HOLIDAY_COUNT+1;
end;
drop i HOLIDAY_DATE;
run;
And another solution: A data-null-step is used to create a data step that does all the work.
data work.holidays;
input holiday_id holiday_date mmddyy10.;
datalines;
1 12/25/2013
2 01/01/2014
3 05/27/2014
;
data work.emps;
length emp_id 8 work_start work_end 8;
informat work_: mmddyy10.;
format work_: date9.;
input emp_id work_start work_end;
datalines;
1 12/24/2013 05/26/2014
2 12/24/2013 06/01/2014
3 01/06/2014 03/23/2014
;
data _null_;
set work.holidays end= jobDone;
length dummy $ 100;
if _n_ = 1 then do;
call execute('data work.want; set work.emps;');
call execute('holiday_count =');
end;
dummy = '';
if _n_ > 1 then do;
dummy = '+';
end;
dummy = catx(' ', dummy, '(work_start <=', holiday_date, '<= work_end)');
call execute(dummy);
put ' ' dummy;
if jobDone then do;
call execute('; run;');
end;
run;
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.