BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
C_andrade
Calcite | Level 5

Hi, first time posting here! 🙂

 

I don't know if this is the right location for the post... 

 

I have a dataset of 34 mio. records on a monthly basis.

 

For every record there is ID, payment, startdate and enddate. There are 3 mio. unique ID's. Startdates usually go from e.g. 1/1 to 31/1, 5/1 to 5/30 and the like and cover for what period the individual was paid. So an individual who was employed all year will have 12 records, one for every month and the dates will all in all cover the whole year. Some individuals can have more than 12 because they had jobs on the side or multiple employments - and there may be overlaps. Some have less than 12 because they were unemployed for some time or for other reasons, but may still have overlaps.

 

I would like to know for how many days in a year every indivdual was in employment.

 

So it is about counting the days of every individual.

 

My first solution was to create a dummy for every day of the year for every record.

 

I made a macro in the form of:

 

%macro days;

%do i=1 %to 365;
%let day=%eval(’1jan12’d + &i.)

data want;
set have;
if startdate le day le enddate then v&i.=1; else v&i.=0;
run;

%end;
%mend;
%days;

 But this is very time consuming and inefficient. Going through 34 mio. records creating 365 new variables would be something like 2-3 hours if not more. 

 

Any ideas?

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

A simple and quick solution involves an array:

 

data have;
length d1 d2 $10;
input ID Payment d1 d2;
startDate = input(translate(d1,"-","/"), mmddyy10.);
endDate = input(translate(d2,"-","/"), mmddyy10.);
format startDate endDate yymmdd10.;
drop d1 d2;
datalines;
1	  100		1/1-2012	1/31-2012
1	  100		2/1-2012	2/29-2012
1	  100		3/1-2012	3/31-2012
1	  120		4/1-2012	4/30-2012
1	  120		5/1-2012	5/31-2012
1	  80		6/1-2012	6/30-2012
1	  90		7/1-2012	7/31-2012
1	  100		8/1-2012	8/31-2012
1	  100		9/1-2012	9/30-2012
1	  100		10/1-2012      10/31-2012
1	  100		11/1-2012      11/30-2012
1	  200		12/1-2012      12/31-2012
2	  50	        3/3-2012	3/31-2012
2	  55		4/2-2012        4/30-2012
2	  55		5/1-2012	5/31-2012
2	  25		8/1-2012        8/31-2012
3	  17	        5/1-2012	5/17-2012
3	  13		5/5-2012	5/10-2012
3	  15		5/15-2012	5/28-2012      <- ADDED
3	  18	 	5/20-2012       5/28-2012
3	  20	        6/1-2012	6/22-2012
3         55            6/17-2012       6/26-2012
3         10            6/28-2012       6/30-2012 
;

proc sql noprint;
select min(startDate), max(endDate) into :fromDate, :toDate
from have;
run;

data want;
array d{&fromDate. : &toDate.} _temporary_;
set have; by id;
if first.id then call missing(of d{*});
do i = startDate to endDate;
    d{i} = 1;
    end;
if last.id then do;
    workedDays = sum(of d{*});
    output;
    end;
keep id workedDays;
run;

proc print data=want noobs; run;

 

PG

View solution in original post

9 REPLIES 9
PaigeMiller
Diamond | Level 26

To count dates, on each record the INTCK function could be used — or simple substraction

days_worked=end_date–start_date+1;

 

Then add up all the days worked by each individual (by year if desired) using PROC SUMMARY.

--
Paige Miller
C_andrade
Calcite | Level 5

Hi PaigeMiller,

 

Thank you for your answer!

 

The days_worked=end_date–start_date+1 I already did. That was in fact my first and earliest go actually and I have the code giving me the variable of the number of days in the period for every observation, but problem comes up for individuals who have worked in overlapping periods. Say, someone worked 2/1 to 2/10, but also worked 2/5 to 2/15. That would give 10 + 10 days, but he really only worked for 15 days that month. And that is what I am interested in: how many days of the month in question is he actually employed.

 

But maybe I could return to this approach if I could somehow collapse the two records with days overlapping to one that says 1/2 to 15/2. So for every ID (individuals) there exist overlapping time periods the earliest start day and latest end date will be the output. That would mean maybe going from 34 mio. to 30 mio. and then simply just counting the days.

LinusH
Tourmaline | Level 20
Yes, the macro approach is quite ineffective.
Even if I can't see any easy solution, there are other ways.
Have the data sorted by id, start and end date. Use a data step w by and first. A
and last. processing. Retain previous observations dates and adjust the current observation start date, and conditionally output. The to remove any overlapping. Then you can use SQL to sum up all dates for each id.

Another option which is probably easier to code, but less efficient is to use the data step to explicit output one observation for each date on the original observation. The again use SQL to sum up for each id, but now you must use the count (distinct date) construct.
Data never sleeps
C_andrade
Calcite | Level 5

Hi LinusH,

 

Thank you for your answer.

 

First solution sounds like something that is workable. So if I understand it correctly you suggest finding all workperiods of consecutive days. An individual with 12 records may end up with 3 records that consist of exactly the days covering the year he was in employment indicated by a startdate and a enddate for that period?

 

Sounds a bit tricky to code. 

 

Second solution sound like the macro approach, but without a macro, right? But that would give me 365 outputs? Not sure with using SQL to summing up for each ID, would that be faster than having SAS do it?

C_andrade
Calcite | Level 5

 


ID Payment Startdate Enddate 1 100 1/1-2012 1/31-2012 1 100 2/1-2012 2/29-2012 1 100 3/1-2012 3/31-2012 1 120 4/1-2012 4/30-2012 1 120 5/1-2012 5/31-2012 1 80 6/1-2012 6/30-2012 1 90 7/1-2012 7/31-2012 1 100 8/1-2012 8/31-2012 1 100 9/1-2012 9/30-2012 1 100 10/1-2012 10/31-2012 1 100 11/1-2012 11/30-2012 1 200 12/1-2012 12/31-2012 2 50 3/3-2012 3/31-2012 2 55 4/2-2012 4/30-2012 2 55 5/1-2012 5/31-2012 2 25 8/1-2012 8/31-2012 3 17 5/1-2012 5/17-2012 3 13 5/5-2012 5/10-2012 3 18 5/20-2012 5/28-2012 3 20 6/1-2012 6/22-2012 3 55 6/17-2012 6/26-2012
3 10 6/28-2012 6/30-2012

Just to give you an idea of the data. 34 mio. records and main problem - so far - is the overlapping periods.

 

So either...

1. some way of counting the days but taking care of overlapping days of employment. So ID 3 would have 26 days in may and not 31 days, 29 days in june and not 35...

 

Or 

 

2. some way of collapsing records with consecutive days into on observation either on a) month basis or b) year basis: So ID 1 would be unleft or have a period running from 1/1-2012 to 12/31-2012. ID 3 then:

 

3	  17	        5/1-2012	5/17-2012
3	  18	 	5/20-2012       5/28-2012
3	  20	        6/1-2012	6/26-2012
3         10            6/28-2012       6/30-2012 

 

 

 

ChrisNZ
Tourmaline | Level 20

The easiest way is to run a tally of overlappnig periods to see if you are employed or not.

So you need to read the data sorted by event date.

 

data HAVE;
  input ID  START : mmddyy.   END : mmddyy.;
  format START END date9.;
  cards;
3	  5/1/2012	5/17/2012
3	  5/5/2012	5/10/2012
3	 	5/15/2012 5/28/2012    <= I added another overlap here
3	  6/1/2012	6/22/2012
3   6/17/2012 6/26/2012
3   6/28/2012 6/30/2012 
run;
data SORTED(index=(A=(ID DATE)));
  set HAVE;
  by ID;
  DATE=START; PERIODS=+1; output;
  DATE=END  ; PERIODS=-1; output;
run;
data WANT;
  retain DAY_START TOTAL_DAYS 0;
  set SORTED;
  by ID DATE;      
  format DAY_START DATE date9.;
      *putlog ID DATE  date9. ' ' STATUS;   * uncomment to see intermediate data;
      *drop START END PERIODS STATUS;       * uncomment to see intermediate data;
  STATUS+PERIODS;
  if DAY_START=0 and STATUS=1 then DAY_START=DATE;
  if STATUS=0 then do;
    TOTAL_DAYS+DATE-DAY_START+1;
    DAY_START=0;
  end;
  if last.ID then do;
    output;                                 * comment out to see intermediate data;
    DAY_START =0;
TOTAL_DAYS=0; end; run;

 

Hth.

PGStats
Opal | Level 21

A simple and quick solution involves an array:

 

data have;
length d1 d2 $10;
input ID Payment d1 d2;
startDate = input(translate(d1,"-","/"), mmddyy10.);
endDate = input(translate(d2,"-","/"), mmddyy10.);
format startDate endDate yymmdd10.;
drop d1 d2;
datalines;
1	  100		1/1-2012	1/31-2012
1	  100		2/1-2012	2/29-2012
1	  100		3/1-2012	3/31-2012
1	  120		4/1-2012	4/30-2012
1	  120		5/1-2012	5/31-2012
1	  80		6/1-2012	6/30-2012
1	  90		7/1-2012	7/31-2012
1	  100		8/1-2012	8/31-2012
1	  100		9/1-2012	9/30-2012
1	  100		10/1-2012      10/31-2012
1	  100		11/1-2012      11/30-2012
1	  200		12/1-2012      12/31-2012
2	  50	        3/3-2012	3/31-2012
2	  55		4/2-2012        4/30-2012
2	  55		5/1-2012	5/31-2012
2	  25		8/1-2012        8/31-2012
3	  17	        5/1-2012	5/17-2012
3	  13		5/5-2012	5/10-2012
3	  15		5/15-2012	5/28-2012      <- ADDED
3	  18	 	5/20-2012       5/28-2012
3	  20	        6/1-2012	6/22-2012
3         55            6/17-2012       6/26-2012
3         10            6/28-2012       6/30-2012 
;

proc sql noprint;
select min(startDate), max(endDate) into :fromDate, :toDate
from have;
run;

data want;
array d{&fromDate. : &toDate.} _temporary_;
set have; by id;
if first.id then call missing(of d{*});
do i = startDate to endDate;
    d{i} = 1;
    end;
if last.id then do;
    workedDays = sum(of d{*});
    output;
    end;
keep id workedDays;
run;

proc print data=want noobs; run;

 

PG
C_andrade
Calcite | Level 5
Wow! So short and simple and amazingly fast! Thanks

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 1881 views
  • 2 likes
  • 5 in conversation