Hi, first time posting here! 🙂
I don't know if this is the right location for the post...
I have a dataset of 34 mio. records on a monthly basis.
For every record there is ID, payment, startdate and enddate. There are 3 mio. unique ID's. Startdates usually go from e.g. 1/1 to 31/1, 5/1 to 5/30 and the like and cover for what period the individual was paid. So an individual who was employed all year will have 12 records, one for every month and the dates will all in all cover the whole year. Some individuals can have more than 12 because they had jobs on the side or multiple employments - and there may be overlaps. Some have less than 12 because they were unemployed for some time or for other reasons, but may still have overlaps.
I would like to know for how many days in a year every indivdual was in employment.
So it is about counting the days of every individual.
My first solution was to create a dummy for every day of the year for every record.
I made a macro in the form of:
%macro days;
%do i=1 %to 365;
%let day=%eval(’1jan12’d + &i.)
data want;
set have;
if startdate le day le enddate then v&i.=1; else v&i.=0;
run;
%end;
%mend;
%days;
But this is very time consuming and inefficient. Going through 34 mio. records creating 365 new variables would be something like 2-3 hours if not more.
Any ideas?
A simple and quick solution involves an array:
data have;
length d1 d2 $10;
input ID Payment d1 d2;
startDate = input(translate(d1,"-","/"), mmddyy10.);
endDate = input(translate(d2,"-","/"), mmddyy10.);
format startDate endDate yymmdd10.;
drop d1 d2;
datalines;
1 100 1/1-2012 1/31-2012
1 100 2/1-2012 2/29-2012
1 100 3/1-2012 3/31-2012
1 120 4/1-2012 4/30-2012
1 120 5/1-2012 5/31-2012
1 80 6/1-2012 6/30-2012
1 90 7/1-2012 7/31-2012
1 100 8/1-2012 8/31-2012
1 100 9/1-2012 9/30-2012
1 100 10/1-2012 10/31-2012
1 100 11/1-2012 11/30-2012
1 200 12/1-2012 12/31-2012
2 50 3/3-2012 3/31-2012
2 55 4/2-2012 4/30-2012
2 55 5/1-2012 5/31-2012
2 25 8/1-2012 8/31-2012
3 17 5/1-2012 5/17-2012
3 13 5/5-2012 5/10-2012
3 15 5/15-2012 5/28-2012 <- ADDED
3 18 5/20-2012 5/28-2012
3 20 6/1-2012 6/22-2012
3 55 6/17-2012 6/26-2012
3 10 6/28-2012 6/30-2012
;
proc sql noprint;
select min(startDate), max(endDate) into :fromDate, :toDate
from have;
run;
data want;
array d{&fromDate. : &toDate.} _temporary_;
set have; by id;
if first.id then call missing(of d{*});
do i = startDate to endDate;
d{i} = 1;
end;
if last.id then do;
workedDays = sum(of d{*});
output;
end;
keep id workedDays;
run;
proc print data=want noobs; run;
To count dates, on each record the INTCK function could be used — or simple substraction
days_worked=end_date–start_date+1;
Then add up all the days worked by each individual (by year if desired) using PROC SUMMARY.
Hi PaigeMiller,
Thank you for your answer!
The days_worked=end_date–start_date+1 I already did. That was in fact my first and earliest go actually and I have the code giving me the variable of the number of days in the period for every observation, but problem comes up for individuals who have worked in overlapping periods. Say, someone worked 2/1 to 2/10, but also worked 2/5 to 2/15. That would give 10 + 10 days, but he really only worked for 15 days that month. And that is what I am interested in: how many days of the month in question is he actually employed.
But maybe I could return to this approach if I could somehow collapse the two records with days overlapping to one that says 1/2 to 15/2. So for every ID (individuals) there exist overlapping time periods the earliest start day and latest end date will be the output. That would mean maybe going from 34 mio. to 30 mio. and then simply just counting the days.
Hi LinusH,
Thank you for your answer.
First solution sounds like something that is workable. So if I understand it correctly you suggest finding all workperiods of consecutive days. An individual with 12 records may end up with 3 records that consist of exactly the days covering the year he was in employment indicated by a startdate and a enddate for that period?
Sounds a bit tricky to code.
Second solution sound like the macro approach, but without a macro, right? But that would give me 365 outputs? Not sure with using SQL to summing up for each ID, would that be faster than having SAS do it?
ID Payment Startdate Enddate 1 100 1/1-2012 1/31-2012 1 100 2/1-2012 2/29-2012 1 100 3/1-2012 3/31-2012 1 120 4/1-2012 4/30-2012 1 120 5/1-2012 5/31-2012 1 80 6/1-2012 6/30-2012 1 90 7/1-2012 7/31-2012 1 100 8/1-2012 8/31-2012 1 100 9/1-2012 9/30-2012 1 100 10/1-2012 10/31-2012 1 100 11/1-2012 11/30-2012 1 200 12/1-2012 12/31-2012 2 50 3/3-2012 3/31-2012 2 55 4/2-2012 4/30-2012 2 55 5/1-2012 5/31-2012 2 25 8/1-2012 8/31-2012 3 17 5/1-2012 5/17-2012 3 13 5/5-2012 5/10-2012 3 18 5/20-2012 5/28-2012 3 20 6/1-2012 6/22-2012 3 55 6/17-2012 6/26-2012
3 10 6/28-2012 6/30-2012
Just to give you an idea of the data. 34 mio. records and main problem - so far - is the overlapping periods.
So either...
1. some way of counting the days but taking care of overlapping days of employment. So ID 3 would have 26 days in may and not 31 days, 29 days in june and not 35...
Or
2. some way of collapsing records with consecutive days into on observation either on a) month basis or b) year basis: So ID 1 would be unleft or have a period running from 1/1-2012 to 12/31-2012. ID 3 then:
3 17 5/1-2012 5/17-2012 3 18 5/20-2012 5/28-2012 3 20 6/1-2012 6/26-2012 3 10 6/28-2012 6/30-2012
The easiest way is to run a tally of overlappnig periods to see if you are employed or not.
So you need to read the data sorted by event date.
data HAVE;
input ID START : mmddyy. END : mmddyy.;
format START END date9.;
cards;
3 5/1/2012 5/17/2012
3 5/5/2012 5/10/2012
3 5/15/2012 5/28/2012 <= I added another overlap here
3 6/1/2012 6/22/2012
3 6/17/2012 6/26/2012
3 6/28/2012 6/30/2012
run;
data SORTED(index=(A=(ID DATE)));
set HAVE;
by ID;
DATE=START; PERIODS=+1; output;
DATE=END ; PERIODS=-1; output;
run;
data WANT;
retain DAY_START TOTAL_DAYS 0;
set SORTED;
by ID DATE;
format DAY_START DATE date9.;
*putlog ID DATE date9. ' ' STATUS; * uncomment to see intermediate data;
*drop START END PERIODS STATUS; * uncomment to see intermediate data;
STATUS+PERIODS;
if DAY_START=0 and STATUS=1 then DAY_START=DATE;
if STATUS=0 then do;
TOTAL_DAYS+DATE-DAY_START+1;
DAY_START=0;
end;
if last.ID then do;
output; * comment out to see intermediate data;
DAY_START =0;
TOTAL_DAYS=0;
end;
run;
Hth.
A simple and quick solution involves an array:
data have;
length d1 d2 $10;
input ID Payment d1 d2;
startDate = input(translate(d1,"-","/"), mmddyy10.);
endDate = input(translate(d2,"-","/"), mmddyy10.);
format startDate endDate yymmdd10.;
drop d1 d2;
datalines;
1 100 1/1-2012 1/31-2012
1 100 2/1-2012 2/29-2012
1 100 3/1-2012 3/31-2012
1 120 4/1-2012 4/30-2012
1 120 5/1-2012 5/31-2012
1 80 6/1-2012 6/30-2012
1 90 7/1-2012 7/31-2012
1 100 8/1-2012 8/31-2012
1 100 9/1-2012 9/30-2012
1 100 10/1-2012 10/31-2012
1 100 11/1-2012 11/30-2012
1 200 12/1-2012 12/31-2012
2 50 3/3-2012 3/31-2012
2 55 4/2-2012 4/30-2012
2 55 5/1-2012 5/31-2012
2 25 8/1-2012 8/31-2012
3 17 5/1-2012 5/17-2012
3 13 5/5-2012 5/10-2012
3 15 5/15-2012 5/28-2012 <- ADDED
3 18 5/20-2012 5/28-2012
3 20 6/1-2012 6/22-2012
3 55 6/17-2012 6/26-2012
3 10 6/28-2012 6/30-2012
;
proc sql noprint;
select min(startDate), max(endDate) into :fromDate, :toDate
from have;
run;
data want;
array d{&fromDate. : &toDate.} _temporary_;
set have; by id;
if first.id then call missing(of d{*});
do i = startDate to endDate;
d{i} = 1;
end;
if last.id then do;
workedDays = sum(of d{*});
output;
end;
keep id workedDays;
run;
proc print data=want noobs; run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.