BookmarkSubscribeRSS Feed
Jing333
Calcite | Level 5

Hi,

I have a table with a list of records that are linked to a group ID. Each record has a start date and an end date. Each record is active during this date range:

Group IDDays Within Range for Group IDRecordStart DateEnd Date
A5115
A5215
B10315
B104610

I am trying to calculate the average number of records that are active for a group ID on a given date e.g for group ID A this would be 2 (as both records are active on every day within the group ID range). For B this would be 1 as all days within the range have an active record, and neither one of the records overlap.

Here is the approach that I have taken:

1 - Create a table listing all the dates within the total range of all group IDs

data alldates ;

  do Date = '03jan2010'd to '03jan2015'd ;

    output ;

  end ;

  format Date date9. ;

run ;

2 - Joining this with the record table. I matched on the group ID and where the date is within the Start Date and End date for each record. This then created a table listing all of the active days for each record i.e. a record active for 5 days will have 5 rows in this table

proc sql;

  create table record_active_day_summary as

  select *

  from alldates as a

  inner join

  record_info as b

  on  a.group=b.group

  where a.Date between b.Start_Date and b.End_Date

  quit;

3 - Summarize this table by the average number of times each day appears for each group ID. This gives the final output that I'm looking for.

proc sql;

  create table group_summary as

  Select distinct Group

  count(Date)/count(distinct Date) as Avg_Records_Per_Day

  from record_active_day_summary

  quit;

Does anybody have an recommendations on how I can speed this up? Ideally I would like to figure out a way to cut out  Step 2 - this is an intermediary  creating a huge data set for me (around 95 million rows).

Thanks in advance!

2 REPLIES 2
Astounding
PROC Star

There's probably something better than this available, but here's an improved approach.  The effectiveness will depend on the date ranges that actually appear in your data, for each group_ID.

Summarize the record data first:

proc summary data=record_info;

   by group_ID;

   var start_date end_date;

   output out=date_ranges (keep=group_ID range_begins range_ends)

              min(start_date) = range_begins  max(end_date) = range_ends;

run;

There are a few ways to proceed from here, but you won't have to process the full five years for every group_ID.

Good luck.

Haikuo
Onyx | Level 15

If all you care about is the average record-day on GroupID level, then It seems to me the key is to figure out the total number of Effective Days for each Groupid, taking consideration of gaps, overlaps etc. The following code is supposed to have better performance by leveraging Hash table:

data have;

         input GroupID$  record     Start_Date End_Date;

     format Start_Date    End_Date date9.;

     cards;

A    1    1    5

A    2    1    5

B    3    1    5

B    4    6    10

C    1    1    8

C    2    1    5

C    3    5    10

C    4    14   18

;

data want;

     dcl hash h(ordered:'a');

     h.definekey('dt');

     h.definedone();

     do until (last.groupid);

           set have;

           by groupid notsorted;

           do dt=start_date to end_date;

                rc=h.ref();

           end;

           tot_rec_day+intck('day', start_date, end_date)+1;

     end;

     avg_record=tot_rec_day/h.num_items;

     output;

     tot_rec_day=.;

     keep groupid tot_rec_day avg_record;

run;


If you already have the accurate effective days for each groupid, as being presented, the following step may be skipped:

        tot_rec_day+intck('day', start_date, end_date)+1;

instead, just go ahead use the range number came with your incoming data.

Message was edited by: haikuo bian

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 2 replies
  • 1188 views
  • 0 likes
  • 3 in conversation