Solved: How many days of employment for every ID

C_andrade · Posted 10-31-2015 11:50 AM

Hi, first time posting here! 🙂

I don't know if this is the right location for the post...

I have a dataset of 34 mio. records on a monthly basis.

For every record there is ID, payment, startdate and enddate. There are 3 mio. unique ID's. Startdates usually go from e.g. 1/1 to 31/1, 5/1 to 5/30 and the like and cover for what period the individual was paid. So an individual who was employed all year will have 12 records, one for every month and the dates will all in all cover the whole year. Some individuals can have more than 12 because they had jobs on the side or multiple employments - and there may be overlaps. Some have less than 12 because they were unemployed for some time or for other reasons, but may still have overlaps.

I would like to know for how many days in a year every indivdual was in employment.

So it is about counting the days of every individual.

My first solution was to create a dummy for every day of the year for every record.

I made a macro in the form of:

%macro days;

%do i=1 %to 365;
%let day=%eval(’1jan12’d + &i.)

data want;
set have;
if startdate le day le enddate then v&i.=1; else v&i.=0;
run;

%end;
%mend;
%days;

But this is very time consuming and inefficient. Going through 34 mio. records creating 365 new variables would be something like 2-3 hours if not more.

Any ideas?

PGStats · Posted 11-01-2015 11:07 PM

A simple and quick solution involves an array:

data have;
length d1 d2 $10;
input ID Payment d1 d2;
startDate = input(translate(d1,"-","/"), mmddyy10.);
endDate = input(translate(d2,"-","/"), mmddyy10.);
format startDate endDate yymmdd10.;
drop d1 d2;
datalines;
1	  100		1/1-2012	1/31-2012
1	  100		2/1-2012	2/29-2012
1	  100		3/1-2012	3/31-2012
1	  120		4/1-2012	4/30-2012
1	  120		5/1-2012	5/31-2012
1	  80		6/1-2012	6/30-2012
1	  90		7/1-2012	7/31-2012
1	  100		8/1-2012	8/31-2012
1	  100		9/1-2012	9/30-2012
1	  100		10/1-2012      10/31-2012
1	  100		11/1-2012      11/30-2012
1	  200		12/1-2012      12/31-2012
2	  50	        3/3-2012	3/31-2012
2	  55		4/2-2012        4/30-2012
2	  55		5/1-2012	5/31-2012
2	  25		8/1-2012        8/31-2012
3	  17	        5/1-2012	5/17-2012
3	  13		5/5-2012	5/10-2012
3	  15		5/15-2012	5/28-2012      <- ADDED
3	  18	 	5/20-2012       5/28-2012
3	  20	        6/1-2012	6/22-2012
3         55            6/17-2012       6/26-2012
3         10            6/28-2012       6/30-2012 
;

proc sql noprint;
select min(startDate), max(endDate) into :fromDate, :toDate
from have;
run;

data want;
array d{&fromDate. : &toDate.} _temporary_;
set have; by id;
if first.id then call missing(of d{*});
do i = startDate to endDate;
    d{i} = 1;
    end;
if last.id then do;
    workedDays = sum(of d{*});
    output;
    end;
keep id workedDays;
run;

proc print data=want noobs; run;

PG

View solution in original post

PaigeMiller · Posted 10-31-2015 01:52 PM

To count dates, on each record the INTCK function could be used — or simple substraction

days_worked=end_date–start_date+1;

Then add up all the days worked by each individual (by year if desired) using PROC SUMMARY.

--
Paige Miller

C_andrade · Posted 11-01-2015 03:07 AM

Hi PaigeMiller,

Thank you for your answer!

The days_worked=end_date–start_date+1 I already did. That was in fact my first and earliest go actually and I have the code giving me the variable of the number of days in the period for every observation, but problem comes up for individuals who have worked in overlapping periods. Say, someone worked 2/1 to 2/10, but also worked 2/5 to 2/15. That would give 10 + 10 days, but he really only worked for 15 days that month. And that is what I am interested in: how many days of the month in question is he actually employed.

But maybe I could return to this approach if I could somehow collapse the two records with days overlapping to one that says 1/2 to 15/2. So for every ID (individuals) there exist overlapping time periods the earliest start day and latest end date will be the output. That would mean maybe going from 34 mio. to 30 mio. and then simply just counting the days.

LinusH · Posted 10-31-2015 02:00 PM

Yes, the macro approach is quite ineffective.
Even if I can't see any easy solution, there are other ways.
Have the data sorted by id, start and end date. Use a data step w by and first. A
and last. processing. Retain previous observations dates and adjust the current observation start date, and conditionally output. The to remove any overlapping. Then you can use SQL to sum up all dates for each id.

Another option which is probably easier to code, but less efficient is to use the data step to explicit output one observation for each date on the original observation. The again use SQL to sum up for each id, but now you must use the count (distinct date) construct.

Data never sleeps

C_andrade · Posted 11-01-2015 03:26 AM

Hi LinusH,

Thank you for your answer.

First solution sounds like something that is workable. So if I understand it correctly you suggest finding all workperiods of consecutive days. An individual with 12 records may end up with 3 records that consist of exactly the days covering the year he was in employment indicated by a startdate and a enddate for that period?

Sounds a bit tricky to code.

Second solution sound like the macro approach, but without a macro, right? But that would give me 365 outputs? Not sure with using SQL to summing up for each ID, would that be faster than having SAS do it?

C_andrade · Posted 11-01-2015 04:32 AM


ID	Payment		Startdate	Enddate
1	  100		1/1-2012	1/31-2012
1	  100		2/1-2012	2/29-2012
1	  100		3/1-2012	3/31-2012
1	  120		4/1-2012	4/30-2012
1	  120		5/1-2012	5/31-2012
1	  80		6/1-2012	6/30-2012
1	  90		7/1-2012	7/31-2012
1	  100		8/1-2012	8/31-2012
1	  100		9/1-2012	9/30-2012
1	  100		10/1-2012      10/31-2012
1	  100		11/1-2012      11/30-2012
1	  200		12/1-2012      12/31-2012

2	  50	        3/3-2012	3/31-2012
2	  55		4/2-2012        4/30-2012
2	  55		5/1-2012	5/31-2012
2	  25		8/1-2012        8/31-2012

3	  17	        5/1-2012	5/17-2012
3	  13		5/5-2012	5/10-2012
3	  18	 	5/20-2012       5/28-2012
3	  20	        6/1-2012	6/22-2012
3         55            6/17-2012       6/26-2012
3         10            6/28-2012       6/30-2012

Just to give you an idea of the data. 34 mio. records and main problem - so far - is the overlapping periods.

So either...

1. some way of counting the days but taking care of overlapping days of employment. So ID 3 would have 26 days in may and not 31 days, 29 days in june and not 35...

Or

2. some way of collapsing records with consecutive days into on observation either on a) month basis or b) year basis: So ID 1 would be unleft or have a period running from 1/1-2012 to 12/31-2012. ID 3 then:

3	  17	        5/1-2012	5/17-2012
3	  18	 	5/20-2012       5/28-2012
3	  20	        6/1-2012	6/26-2012
3         10            6/28-2012       6/30-2012

ChrisNZ · Posted 11-01-2015 09:35 PM

The easiest way is to run a tally of overlappnig periods to see if you are employed or not.

So you need to read the data sorted by event date.

data HAVE;
  input ID  START : mmddyy.   END : mmddyy.;
  format START END date9.;
  cards;
3	  5/1/2012	5/17/2012
3	  5/5/2012	5/10/2012
3	 	5/15/2012 5/28/2012    <= I added another overlap here
3	  6/1/2012	6/22/2012
3   6/17/2012 6/26/2012
3   6/28/2012 6/30/2012 
run;
data SORTED(index=(A=(ID DATE)));
  set HAVE;
  by ID;
  DATE=START; PERIODS=+1; output;
  DATE=END  ; PERIODS=-1; output;
run;
data WANT;
  retain DAY_START TOTAL_DAYS 0;
  set SORTED;
  by ID DATE;      
  format DAY_START DATE date9.;
      *putlog ID DATE  date9. ' ' STATUS;   * uncomment to see intermediate data;
      *drop START END PERIODS STATUS;       * uncomment to see intermediate data;
  STATUS+PERIODS;
  if DAY_START=0 and STATUS=1 then DAY_START=DATE;
  if STATUS=0 then do;
    TOTAL_DAYS+DATE-DAY_START+1;
    DAY_START=0;
  end;
  if last.ID then do;
    output;                                 * comment out to see intermediate data;
    DAY_START =0;
    TOTAL_DAYS=0;
  end;
run;

Hth.

High-Performance SAS Coding - Third Edition

PGStats · Posted 11-01-2015 11:07 PM

A simple and quick solution involves an array:

data have;
length d1 d2 $10;
input ID Payment d1 d2;
startDate = input(translate(d1,"-","/"), mmddyy10.);
endDate = input(translate(d2,"-","/"), mmddyy10.);
format startDate endDate yymmdd10.;
drop d1 d2;
datalines;
1	  100		1/1-2012	1/31-2012
1	  100		2/1-2012	2/29-2012
1	  100		3/1-2012	3/31-2012
1	  120		4/1-2012	4/30-2012
1	  120		5/1-2012	5/31-2012
1	  80		6/1-2012	6/30-2012
1	  90		7/1-2012	7/31-2012
1	  100		8/1-2012	8/31-2012
1	  100		9/1-2012	9/30-2012
1	  100		10/1-2012      10/31-2012
1	  100		11/1-2012      11/30-2012
1	  200		12/1-2012      12/31-2012
2	  50	        3/3-2012	3/31-2012
2	  55		4/2-2012        4/30-2012
2	  55		5/1-2012	5/31-2012
2	  25		8/1-2012        8/31-2012
3	  17	        5/1-2012	5/17-2012
3	  13		5/5-2012	5/10-2012
3	  15		5/15-2012	5/28-2012      <- ADDED
3	  18	 	5/20-2012       5/28-2012
3	  20	        6/1-2012	6/22-2012
3         55            6/17-2012       6/26-2012
3         10            6/28-2012       6/30-2012 
;

proc sql noprint;
select min(startDate), max(endDate) into :fromDate, :toDate
from have;
run;

data want;
array d{&fromDate. : &toDate.} _temporary_;
set have; by id;
if first.id then call missing(of d{*});
do i = startDate to endDate;
    d{i} = 1;
    end;
if last.id then do;
    workedDays = sum(of d{*});
    output;
    end;
keep id workedDays;
run;

proc print data=want noobs; run;

PG

ChrisNZ · Posted 11-01-2015 11:45 PM

Nice!

High-Performance SAS Coding - Third Edition

C_andrade · Posted 11-02-2015 04:26 AM

Wow! So short and simple and amazingly fast! Thanks

How many days of employment for every ID

Re: How many days of employment for every ID

Re: How many days of employment for every ID

Re: How many days of employment for every ID

Re: How many days of employment for every ID

Re: How many days of employment for every ID

Re: How many days of employment for every ID

Re: How many days of employment for every ID

Re: How many days of employment for every ID

Re: How many days of employment for every ID

Re: How many days of employment for every ID

The 2025 SAS Hackathon has begun!

SAS Training: Just a Click Away