Solved: Re: Selecting only the record with the smallest difference between dat...

twerwath · Posted 05-26-2021 10:42 AM

I am working on a situation in which a specific event has occurred for an individual on a given date, and that person has multiple events and multiple personnel records. I want to keep only one record per event, the one closest to the event date (either before or after). I have an idea of how to do this, but am looking for the most efficient possible way since I am working with very large datasets.

I am also not sure what to do when there are two records that are equally close to the event date. In this case, either one could be used, but only one.

An example of what I have:

data have;
input id event_date record_date

datalines;
1 28JAN2019 24JAN2019.
1 28JAN2019 26JAN2019.

1 28JAN2019 3FEB2019.

1 28JAN2019 4FEB2019.

1 3FEB2019 24JAN2019.
1 3FEB2019 26JAN2019.

1 3FEB2019 3FEB2019.

1 3FEB2019 4FEB2019.

2 31JAN2019 28JAN2019.
2 31JAN2019 1FEB2019.

2 31JAN2019 3FEB2019.

2 31JAN2019 4FEB2019.

2 5FEB2019 28JAN2019.
2 5FEB2019 1FEB2019.

2 5FEB2019 3FEB2019.

2 5FEB2019 4FEB2019.

;

run;

An example of what I want:

data want;
input id event_date record_date

datalines;
1 28JAN2019 26JAN2019.

1 3FEB2019 3FEB2019.

2 31JAN2019 1FEB2019.

2 5FEB2019 4FEB2019.

;

run;

Thanks in advance for any help with this!

novinosrin · Posted 05-26-2021 11:16 AM

Well if your version support CUROBS option in set statement, you could get a little aggressive rather than defensive -

data have;
input id (event_date record_date) (:date9.);
format event_date record_date date9.;
datalines;
1 28JAN2019 24JAN2019. 
1 28JAN2019 26JAN2019. 
1 28JAN2019 3FEB2019. 
1 28JAN2019 4FEB2019. 
1 3FEB2019 24JAN2019. 
1 3FEB2019 26JAN2019. 
1 3FEB2019 3FEB2019. 
1 3FEB2019 4FEB2019. 
2 31JAN2019 28JAN2019. 
2 31JAN2019 1FEB2019. 
2 31JAN2019 3FEB2019. 
2 31JAN2019 4FEB2019. 
2 5FEB2019 28JAN2019. 
2 5FEB2019 1FEB2019. 
2 5FEB2019 3FEB2019. 
2 5FEB2019 4FEB2019.
2 5FEB2019 6FEB2019.  /*added one more to include same diff*/ 
;
run;

data want;
 do until(last.event_date);
  set have curobs=_n_;
  by id event_date;
  diff=abs(event_date-record_date);
  if min>. and diff>=min then continue;
  min=diff;
  k=_n_;
 end;
 set have point=k;
 drop k min diff;
run;

View solution in original post

novinosrin · Posted 05-26-2021 10:57 AM

data have;
input id (event_date record_date) (:date9.);
format event_date record_date date9.;
datalines;
1 28JAN2019 24JAN2019. 
1 28JAN2019 26JAN2019. 
1 28JAN2019 3FEB2019. 
1 28JAN2019 4FEB2019. 
1 3FEB2019 24JAN2019. 
1 3FEB2019 26JAN2019. 
1 3FEB2019 3FEB2019. 
1 3FEB2019 4FEB2019. 
2 31JAN2019 28JAN2019. 
2 31JAN2019 1FEB2019. 
2 31JAN2019 3FEB2019. 
2 31JAN2019 4FEB2019. 
2 5FEB2019 28JAN2019. 
2 5FEB2019 1FEB2019. 
2 5FEB2019 6FEB2019.  /*added one more to include same diff*/
2 5FEB2019 3FEB2019. 
2 5FEB2019 4FEB2019. 
;
run;


proc sql;
 create table want(drop=diff) as
 select distinct *
 from
 (select *, abs(event_date-record_date) as diff
 from have
 group by id,event_date
 having min(diff)=diff)
 group by id,event_date,diff
 having min(record_date)=record_date;
quit;

novinosrin · Posted 05-26-2021 11:10 AM

data have;
input id (event_date record_date) (:date9.);
format event_date record_date date9.;
datalines;
1 28JAN2019 24JAN2019. 
1 28JAN2019 26JAN2019. 
1 28JAN2019 3FEB2019. 
1 28JAN2019 4FEB2019. 
1 3FEB2019 24JAN2019. 
1 3FEB2019 26JAN2019. 
1 3FEB2019 3FEB2019. 
1 3FEB2019 4FEB2019. 
2 31JAN2019 28JAN2019. 
2 31JAN2019 1FEB2019. 
2 31JAN2019 3FEB2019. 
2 31JAN2019 4FEB2019. 
2 5FEB2019 28JAN2019. 
2 5FEB2019 1FEB2019. 
2 5FEB2019 3FEB2019. 
2 5FEB2019 4FEB2019.
2 5FEB2019 6FEB2019.  /*added one more to include same diff*/ 
;
run;

data want;
 do _n_=1 by 1 until(last.event_date);
  set have;
  by id event_date;
  diff=abs(event_date-record_date);
  if min>. and diff>=min then continue;
  min=diff;
  k=_n_;
 end;
 do _n_=1 to _n_;
  set have;
  if _n_=k then output;
 end;
 drop k min diff;
run;

novinosrin · Posted 05-26-2021 11:16 AM

Well if your version support CUROBS option in set statement, you could get a little aggressive rather than defensive -

data have;
input id (event_date record_date) (:date9.);
format event_date record_date date9.;
datalines;
1 28JAN2019 24JAN2019. 
1 28JAN2019 26JAN2019. 
1 28JAN2019 3FEB2019. 
1 28JAN2019 4FEB2019. 
1 3FEB2019 24JAN2019. 
1 3FEB2019 26JAN2019. 
1 3FEB2019 3FEB2019. 
1 3FEB2019 4FEB2019. 
2 31JAN2019 28JAN2019. 
2 31JAN2019 1FEB2019. 
2 31JAN2019 3FEB2019. 
2 31JAN2019 4FEB2019. 
2 5FEB2019 28JAN2019. 
2 5FEB2019 1FEB2019. 
2 5FEB2019 3FEB2019. 
2 5FEB2019 4FEB2019.
2 5FEB2019 6FEB2019.  /*added one more to include same diff*/ 
;
run;

data want;
 do until(last.event_date);
  set have curobs=_n_;
  by id event_date;
  diff=abs(event_date-record_date);
  if min>. and diff>=min then continue;
  min=diff;
  k=_n_;
 end;
 set have point=k;
 drop k min diff;
run;

novinosrin · Posted 05-26-2021 11:30 AM

You could even play with Hash contents for fun and to familiarize ways to park and pick object notations of hash methods-

data have;
input id (event_date record_date) (:date9.);
format event_date record_date date9.;
datalines;
1 28JAN2019 24JAN2019. 
1 28JAN2019 26JAN2019. 
1 28JAN2019 3FEB2019. 
1 28JAN2019 4FEB2019. 
1 3FEB2019 24JAN2019. 
1 3FEB2019 26JAN2019. 
1 3FEB2019 3FEB2019. 
1 3FEB2019 4FEB2019. 
2 31JAN2019 28JAN2019. 
2 31JAN2019 1FEB2019. 
2 31JAN2019 3FEB2019. 
2 31JAN2019 4FEB2019. 
2 5FEB2019 28JAN2019. 
2 5FEB2019 1FEB2019. 
2 5FEB2019 3FEB2019. 
2 5FEB2019 4FEB2019.
2 5FEB2019 6FEB2019.  /*added one more to include same diff*/ 
;
run;

data want;
 if _n_=1 then do;
  dcl hash h(dataset:'have(obs=0)');
  h.definekey('id');
  h.definedata(all:'y');
  h.definedone();
 end;
 do _n_=h.clear() by 0 until(last.event_date);
  set have;
  by id event_date;
  diff=abs(event_date-record_date);
  if min>. and diff>=min then continue;
  min=diff;
  h.replace();
 end;
 h.find();
 drop min diff;
run;

Kurt_Bremser · Posted 05-26-2021 11:09 AM

If the dataset is sorted (or at least grouped) by id and event_date, do this:

data want;
dist = 99999;
do until (last.event_date);
  set have;
  by id event_date;
  if abs(event_date - record_date) lt dist
  then do;
    dist = abs(event_date - record_date);
    rd = record_date;
  end;
end;
record_date = rd;
drop rd;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

twerwath · Posted 05-26-2021 12:00 PM

Thank you for your solutions! The issue has been solved, this has been very helpful.

mkeintz · Posted 05-26-2021 12:43 PM

This program assume data are sorted by ID/EVENT_DATE/RECORD_DATE:

Once the data step has reached either then end of an ID/EVENT_DATE group, or has a RECORD_DATE>EVENT_DATE, ;you merely have to decide whether the current record is closer than the prior record. If the prior record is close, re-read it.

This allows a way to read in ALL the variables that are associated with the closest record_date.

data want (drop=_:);
  set have ;
  by id event_date;

  _prior_date=ifn(first.event_date=1,'15oct1582'd,lag(record_date));

  /* Find obs after which no future obs can qualify ... */
  if ((_prior_date<event_date) and 
      (event_date<=record_date or last.event_date))
     or
     (first.event_date=1 and event_date<=record_date);

  /* Decide whether to retrieve prior record */
  if (event_date-_prior_date)<abs(record_date-event_date) then do;
    p=_n_-1;
    set have point=p;
  end;
run;

The _prior_date var is set to 15oct1582 (beginning of Gregorian calendar) every time a new event_date is encountered. This will prevent the program from reaching back to the prior event.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Selecting only the record with the smallest difference between dates

Re: Selecting only the record with the smallest difference between dates

Re: Selecting only the record with the smallest difference between dates

Re: Selecting only the record with the smallest difference between dates

Re: Selecting only the record with the smallest difference between dates

Re: Selecting only the record with the smallest difference between dates

Re: Selecting only the record with the smallest difference between dates

Re: Selecting only the record with the smallest difference between dates

Re: Selecting only the record with the smallest difference between dates

Registration is open

SAS Training: Just a Click Away