Solved: Insert missing date records and have values carried from before or aft...

Solph · Posted 01-14-2021 11:19 PM

Hi,

I have a data with individuals assessed, randomly, between start date and end date. Since it's random, it can be a few days after start date or days before the end date. As as I result, I have missing dates and assessment values I need to fill.

So I need to 1) insert the missing dates (between start and end) and 2) retain the values from the previous records, then backward fill if the missing is from the start date to the first assessment date (see ID 2).

I googled to find solutions but didn't find a sample code that I can directly apply. So much appreciated for any help, at least for inserting the missing date.

BTW, I have two HAVE data below, the first one is a simple one with just two IDs and the looks of WANT data is based on HAVE1. But my actual data has actually two primary keys, e.g. hospital then individual ID (HAVE2 data). So much appreciated if your help is based on data HAVE2.

data have1; format ID 8.0 startdt assessdt enddt yymmdd10.;
input @1 ID @3 startdt yymmdd10. @14 assessdt yymmdd10. @25 enddt yymmdd10. @36 var1 @38 var2;
datalines;
1 2018/01/01 2018/01/01 2018/01/05 1 2 
1 2018/01/01 2018/01/05 2018/01/05 4 5 
2 2019/12/10 2019/12/11 2019/12/15 1 3
2 2019/12/10 2019/12/13 2019/12/15 4 4
;
proc print noobs; run;

data have2; format HospID ID 8.0 startdt assessdt enddt yymmdd10.;
input @1 HospID @5 ID @7 startdt yymmdd10. @18 assessdt yymmdd10. @29 enddt yymmdd10. @40 var1 @42 var2;
datalines;
101 1 2018/01/01 2018/01/01 2018/01/05 1 2 
101 1 2018/01/01 2018/01/05 2018/01/05 4 5 
102 1 2018/01/01 2018/01/01 2018/01/05 1 2 
102 1 2018/01/01 2018/01/05 2018/01/05 4 5 
101 2 2019/12/10 2019/12/11 2019/12/15 1 3
101 2 2019/12/10 2019/12/13 2019/12/15 4 4
103 2 2019/12/10 2019/12/11 2019/12/15 1 3
103 2 2019/12/10 2019/12/13 2019/12/15 4 4
;
proc print noobs; run;

This is the want data I want for Have1,.

1 2018-01-01 2018-01-01 2018-01-05 1 2    
1 2018-01-01 2018-01-02 2018-01-05 1 2  added, carried from the previous record
1 2018-01-01 2018-01-03 2018-01-05 1 2  added, carried from the previous record
1 2018-01-01 2018-01-04 2018-01-05 1 2  added, carried from the previous record
1 2019-01-01 2018-01-05 2018-01-05 4 5   

2 2019-12-10 2019-12-10 2019-12-15 1 3  added, carried from the next record
2 2019-12-10 2019-12-11 2019-12-15 1 3   
2 2019-12-10 2019-12-12 2019-12-15 1 3  added, carried from the previous record
2 2019-12-10 2019-12-13 2019-12-15 4 4  
2 2019-12-10 2019-12-14 2019-12-15 4 4  added, carried from the previous record
2 2019-12-10 2019-12-15 2019-12-15 4 4  added, carried from the previous record

Solph · Posted 01-16-2021 12:33 AM

Thanks all who replied. I thought I'd have a few words before I close the case.

1. I realized having filling the missing dates between start date and end date is not efficient, because my data is easily 100K cases and each spans over a year up to 10 years, so even if the code works, it would not be efficient.

2. I need to either back fill missing dates and values (for days between start date and first assessment date) and forward fill (for days between the last assessment and end date). So it's a daunting task, and messy at time for data pulls with different criteria.

3. I then recalled the data provider already produced a data with those events data restructured at the event level, that is, they reconstruct the data when there is a a turn of event, with proper start date and end for each event assessment record. It probably took them months to first come up with the code and it's fine tuned each year ever since.

For those who are curious how it is done:

1. if there is a gap between start date and 1st assessment date, then they insert a record, with EVENT start date=start date and EVENT end date = 1st assessment date. They then calculate event days between the two dates.

2. if there is a gap between the last assessment record and end date, they insert an EVENT record with EVENT start date = last assessment date and EVENT end date = end date.

3. For everything in between, they'd reconstruct an EVENT record with the 1st assessment date as EVENT start date and the next assessment date as EVENT end date, and so on.

4. There are other turns of events we'd consider (e.g. calendar year end), it's the same logic to break into event records.

The source data (as in sample data) has unique assess ID assigned to each record. The reconstructed data keep the assess ID if the reconstructed event is associated with or derived from the assessment record. So one can link the source data to the reconstructed event data by this assess ID and make use of the number event days to multiply whatever in the source data (such as VAR1 in my sample data).

The code works if we know what turn of events we want. For other analysis, filling by daily date might be more desired I guess.

Thanks again for all who helped to answer and provide your tips. Much appreciated.

View solution in original post

mkeintz · Posted 01-15-2021 12:42 AM

This looks like a homework assignment, so help me to help you help yourself.

First, I assume you always have 2 records per id, yes?
And let's restate your rule:
1. For the first record of each pair, you want to output each date from startdt to 1 day prior to the assessdt from the following record.
2. For the second record you want to output each date from the assessdt to enddt, Correct?

Is that correct?

If so, then you would do a loop with DATE going over the range in 2.1 or 2.2. And you would need an OUTPUT statement inside the loop. The problem would be determining the range.

BTW, going from 1 ID variable to 2 is not an issue.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

mkeintz · Posted 01-15-2021 10:43 PM

data have1; 
  format ID 8.0 startdt assessdt enddt yymmdd10.;
  input @1 ID @3 startdt yymmdd10. @14 assessdt yymmdd10. @25 enddt yymmdd10. @36 var1 @38 var2;
datalines;
1 2018/01/01 2018/01/01 2018/01/05 1 2      
1 2018/01/01 2018/01/05 2018/01/05 4 5      
2 2019/12/10 2019/12/11 2019/12/15 1 3    
2 2019/12/10 2019/12/13 2019/12/15 4 4
;

/* Works for any number of records per id */
data want1 (drop=_:);
  set have1  ;
  by id;

  if end_of_lookahead=0 then 
     set have1 (firstobs=2 keep=assessdt rename=(assessdt=_nxt_dt)) end=end_of_lookahead;

  _start=ifn(first.id,startdt,assessdt);
  _end=ifn(last.id,enddt,_nxt_dt-1);

  do date=_start to _end;
    output;
  end;

  format date date9.;
run;

This program has two SET statements, producing two synchronized streams of data. The first SET reds in the data as usual. The second SET reads starting at the 2nd observation, keeping only the ASSESSDT variable, which has to be renamed in order to prevent overwriting the current ASSESSDT. So each iteration of the data step has in hand STARTDT, ENDDT, ASSESSDT and _NXT_DT.

These variables, plus the first.id and last.id automatic variables (from the BY ID statement) are all you need to determine the range of DATEs to produce from each incoming record.

Note the 2nd SET has an IF condition. Without this condition, it would attempt to read beyond the end of have1 while the first SET has the last record in hand. The data step would prematurely terminate.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Solph · Posted 01-16-2021 12:33 AM

Thanks all who replied. I thought I'd have a few words before I close the case.

1. I realized having filling the missing dates between start date and end date is not efficient, because my data is easily 100K cases and each spans over a year up to 10 years, so even if the code works, it would not be efficient.

2. I need to either back fill missing dates and values (for days between start date and first assessment date) and forward fill (for days between the last assessment and end date). So it's a daunting task, and messy at time for data pulls with different criteria.

3. I then recalled the data provider already produced a data with those events data restructured at the event level, that is, they reconstruct the data when there is a a turn of event, with proper start date and end for each event assessment record. It probably took them months to first come up with the code and it's fine tuned each year ever since.

For those who are curious how it is done:

1. if there is a gap between start date and 1st assessment date, then they insert a record, with EVENT start date=start date and EVENT end date = 1st assessment date. They then calculate event days between the two dates.

2. if there is a gap between the last assessment record and end date, they insert an EVENT record with EVENT start date = last assessment date and EVENT end date = end date.

3. For everything in between, they'd reconstruct an EVENT record with the 1st assessment date as EVENT start date and the next assessment date as EVENT end date, and so on.

4. There are other turns of events we'd consider (e.g. calendar year end), it's the same logic to break into event records.

The source data (as in sample data) has unique assess ID assigned to each record. The reconstructed data keep the assess ID if the reconstructed event is associated with or derived from the assessment record. So one can link the source data to the reconstructed event data by this assess ID and make use of the number event days to multiply whatever in the source data (such as VAR1 in my sample data).

The code works if we know what turn of events we want. For other analysis, filling by daily date might be more desired I guess.

Thanks again for all who helped to answer and provide your tips. Much appreciated.

andreas_lds · Posted 01-15-2021 01:34 AM

Basically you need to verify that the assessdt found in an observation is the value you expect in that obs. If the value matched the expected one, write the obs to the dataset. If not, you need to insert obs until it matches. And there is one special case: when processing the last observation of an Id, you need to add obs, if assessdt < enddt. The expected value of assessdt is the last value +1 or startdt if he first obs (per Id) is processed.

Insert missing date records and have values carried from before or after

Re: Insert missing date records and have values carried from before or after

Re: Insert missing date records and have values carried from before or after

Re: Insert missing date records and have values carried from before or after

Re: Insert missing date records and have values carried from before or after

Re: Insert missing date records and have values carried from before or after

Insert missing date records and have values carried from before or after

Re: Insert missing date records and have values carried from before or after

Re: Insert missing date records and have values carried from before or after

Re: Insert missing date records and have values carried from before or after

Re: Insert missing date records and have values carried from before or after

Re: Insert missing date records and have values carried from before or after

SAS Innovate 2025: Call for Content

Click image to register for webinar

Classroom Training Available!