Not applicable
Posts: 1

# Counting number of records active on a given day

Hi,

I have a table with a list of records that are linked to a group ID. Each record has a start date and an end date. Each record is active during this date range:

Group IDDays Within Range for Group IDRecordStart DateEnd Date
A5115
A5215
B10315
B104610

I am trying to calculate the average number of records that are active for a group ID on a given date e.g for group ID A this would be 2 (as both records are active on every day within the group ID range). For B this would be 1 as all days within the range have an active record, and neither one of the records overlap.

Here is the approach that I have taken:

1 - Create a table listing all the dates within the total range of all group IDs

data alldates ;

do Date = '03jan2010'd to '03jan2015'd ;

output ;

end ;

format Date date9. ;

run ;

2 - Joining this with the record table. I matched on the group ID and where the date is within the Start Date and End date for each record. This then created a table listing all of the active days for each record i.e. a record active for 5 days will have 5 rows in this table

proc sql;

create table record_active_day_summary as

select *

from alldates as a

inner join

record_info as b

on  a.group=b.group

where a.Date between b.Start_Date and b.End_Date

quit;

3 - Summarize this table by the average number of times each day appears for each group ID. This gives the final output that I'm looking for.

proc sql;

create table group_summary as

Select distinct Group

count(Date)/count(distinct Date) as Avg_Records_Per_Day

from record_active_day_summary

quit;

Does anybody have an recommendations on how I can speed this up? Ideally I would like to figure out a way to cut out  Step 2 - this is an intermediary  creating a huge data set for me (around 95 million rows).

Super User
Posts: 6,785

## Re: Counting number of records active on a given day

There's probably something better than this available, but here's an improved approach.  The effectiveness will depend on the date ranges that actually appear in your data, for each group_ID.

Summarize the record data first:

proc summary data=record_info;

by group_ID;

var start_date end_date;

output out=date_ranges (keep=group_ID range_begins range_ends)

min(start_date) = range_begins  max(end_date) = range_ends;

run;

There are a few ways to proceed from here, but you won't have to process the full five years for every group_ID.

Good luck.

Posts: 3,167

## Re: Counting number of records active on a given day

If all you care about is the average record-day on GroupID level, then It seems to me the key is to figure out the total number of Effective Days for each Groupid, taking consideration of gaps, overlaps etc. The following code is supposed to have better performance by leveraging Hash table:

data have;

input GroupID\$  record     Start_Date End_Date;

format Start_Date    End_Date date9.;

cards;

A    1    1    5

A    2    1    5

B    3    1    5

B    4    6    10

C    1    1    8

C    2    1    5

C    3    5    10

C    4    14   18

;

data want;

dcl hash h(ordered:'a');

h.definekey('dt');

h.definedone();

do until (last.groupid);

set have;

by groupid notsorted;

do dt=start_date to end_date;

rc=h.ref();

end;

tot_rec_day+intck('day', start_date, end_date)+1;

end;

avg_record=tot_rec_day/h.num_items;

output;

tot_rec_day=.;

keep groupid tot_rec_day avg_record;

run;

If you already have the accurate effective days for each groupid, as being presented, the following step may be skipped:

tot_rec_day+intck('day', start_date, end_date)+1;