I am working on a situation in which a specific event has occurred for an individual on a given date, and that person has multiple events and multiple personnel records. I want to keep only one record per event, the one closest to the event date (either before or after). I have an idea of how to do this, but am looking for the most efficient possible way since I am working with very large datasets.
I am also not sure what to do when there are two records that are equally close to the event date. In this case, either one could be used, but only one.
An example of what I have:
data have;
input id event_date record_date
datalines;
1 28JAN2019 24JAN2019.
1 28JAN2019 26JAN2019.
1 28JAN2019 3FEB2019.
1 28JAN2019 4FEB2019.
1 3FEB2019 24JAN2019.
1 3FEB2019 26JAN2019.
1 3FEB2019 3FEB2019.
1 3FEB2019 4FEB2019.
2 31JAN2019 28JAN2019.
2 31JAN2019 1FEB2019.
2 31JAN2019 3FEB2019.
2 31JAN2019 4FEB2019.
2 5FEB2019 28JAN2019.
2 5FEB2019 1FEB2019.
2 5FEB2019 3FEB2019.
2 5FEB2019 4FEB2019.
;
run;
An example of what I want:
data want;
input id event_date record_date
datalines;
1 28JAN2019 26JAN2019.
1 3FEB2019 3FEB2019.
2 31JAN2019 1FEB2019.
2 5FEB2019 4FEB2019.
;
run;
Thanks in advance for any help with this!
Well if your version support CUROBS option in set statement, you could get a little aggressive rather than defensive -
data have;
input id (event_date record_date) (:date9.);
format event_date record_date date9.;
datalines;
1 28JAN2019 24JAN2019.
1 28JAN2019 26JAN2019.
1 28JAN2019 3FEB2019.
1 28JAN2019 4FEB2019.
1 3FEB2019 24JAN2019.
1 3FEB2019 26JAN2019.
1 3FEB2019 3FEB2019.
1 3FEB2019 4FEB2019.
2 31JAN2019 28JAN2019.
2 31JAN2019 1FEB2019.
2 31JAN2019 3FEB2019.
2 31JAN2019 4FEB2019.
2 5FEB2019 28JAN2019.
2 5FEB2019 1FEB2019.
2 5FEB2019 3FEB2019.
2 5FEB2019 4FEB2019.
2 5FEB2019 6FEB2019. /*added one more to include same diff*/
;
run;
data want;
do until(last.event_date);
set have curobs=_n_;
by id event_date;
diff=abs(event_date-record_date);
if min>. and diff>=min then continue;
min=diff;
k=_n_;
end;
set have point=k;
drop k min diff;
run;
data have;
input id (event_date record_date) (:date9.);
format event_date record_date date9.;
datalines;
1 28JAN2019 24JAN2019.
1 28JAN2019 26JAN2019.
1 28JAN2019 3FEB2019.
1 28JAN2019 4FEB2019.
1 3FEB2019 24JAN2019.
1 3FEB2019 26JAN2019.
1 3FEB2019 3FEB2019.
1 3FEB2019 4FEB2019.
2 31JAN2019 28JAN2019.
2 31JAN2019 1FEB2019.
2 31JAN2019 3FEB2019.
2 31JAN2019 4FEB2019.
2 5FEB2019 28JAN2019.
2 5FEB2019 1FEB2019.
2 5FEB2019 6FEB2019. /*added one more to include same diff*/
2 5FEB2019 3FEB2019.
2 5FEB2019 4FEB2019.
;
run;
proc sql;
create table want(drop=diff) as
select distinct *
from
(select *, abs(event_date-record_date) as diff
from have
group by id,event_date
having min(diff)=diff)
group by id,event_date,diff
having min(record_date)=record_date;
quit;
data have;
input id (event_date record_date) (:date9.);
format event_date record_date date9.;
datalines;
1 28JAN2019 24JAN2019.
1 28JAN2019 26JAN2019.
1 28JAN2019 3FEB2019.
1 28JAN2019 4FEB2019.
1 3FEB2019 24JAN2019.
1 3FEB2019 26JAN2019.
1 3FEB2019 3FEB2019.
1 3FEB2019 4FEB2019.
2 31JAN2019 28JAN2019.
2 31JAN2019 1FEB2019.
2 31JAN2019 3FEB2019.
2 31JAN2019 4FEB2019.
2 5FEB2019 28JAN2019.
2 5FEB2019 1FEB2019.
2 5FEB2019 3FEB2019.
2 5FEB2019 4FEB2019.
2 5FEB2019 6FEB2019. /*added one more to include same diff*/
;
run;
data want;
do _n_=1 by 1 until(last.event_date);
set have;
by id event_date;
diff=abs(event_date-record_date);
if min>. and diff>=min then continue;
min=diff;
k=_n_;
end;
do _n_=1 to _n_;
set have;
if _n_=k then output;
end;
drop k min diff;
run;
Well if your version support CUROBS option in set statement, you could get a little aggressive rather than defensive -
data have;
input id (event_date record_date) (:date9.);
format event_date record_date date9.;
datalines;
1 28JAN2019 24JAN2019.
1 28JAN2019 26JAN2019.
1 28JAN2019 3FEB2019.
1 28JAN2019 4FEB2019.
1 3FEB2019 24JAN2019.
1 3FEB2019 26JAN2019.
1 3FEB2019 3FEB2019.
1 3FEB2019 4FEB2019.
2 31JAN2019 28JAN2019.
2 31JAN2019 1FEB2019.
2 31JAN2019 3FEB2019.
2 31JAN2019 4FEB2019.
2 5FEB2019 28JAN2019.
2 5FEB2019 1FEB2019.
2 5FEB2019 3FEB2019.
2 5FEB2019 4FEB2019.
2 5FEB2019 6FEB2019. /*added one more to include same diff*/
;
run;
data want;
do until(last.event_date);
set have curobs=_n_;
by id event_date;
diff=abs(event_date-record_date);
if min>. and diff>=min then continue;
min=diff;
k=_n_;
end;
set have point=k;
drop k min diff;
run;
You could even play with Hash contents for fun and to familiarize ways to park and pick object notations of hash methods-
data have;
input id (event_date record_date) (:date9.);
format event_date record_date date9.;
datalines;
1 28JAN2019 24JAN2019.
1 28JAN2019 26JAN2019.
1 28JAN2019 3FEB2019.
1 28JAN2019 4FEB2019.
1 3FEB2019 24JAN2019.
1 3FEB2019 26JAN2019.
1 3FEB2019 3FEB2019.
1 3FEB2019 4FEB2019.
2 31JAN2019 28JAN2019.
2 31JAN2019 1FEB2019.
2 31JAN2019 3FEB2019.
2 31JAN2019 4FEB2019.
2 5FEB2019 28JAN2019.
2 5FEB2019 1FEB2019.
2 5FEB2019 3FEB2019.
2 5FEB2019 4FEB2019.
2 5FEB2019 6FEB2019. /*added one more to include same diff*/
;
run;
data want;
if _n_=1 then do;
dcl hash h(dataset:'have(obs=0)');
h.definekey('id');
h.definedata(all:'y');
h.definedone();
end;
do _n_=h.clear() by 0 until(last.event_date);
set have;
by id event_date;
diff=abs(event_date-record_date);
if min>. and diff>=min then continue;
min=diff;
h.replace();
end;
h.find();
drop min diff;
run;
If the dataset is sorted (or at least grouped) by id and event_date, do this:
data want;
dist = 99999;
do until (last.event_date);
set have;
by id event_date;
if abs(event_date - record_date) lt dist
then do;
dist = abs(event_date - record_date);
rd = record_date;
end;
end;
record_date = rd;
drop rd;
run;
Thank you for your solutions! The issue has been solved, this has been very helpful.
This program assume data are sorted by ID/EVENT_DATE/RECORD_DATE:
Once the data step has reached either then end of an ID/EVENT_DATE group, or has a RECORD_DATE>EVENT_DATE, ;you merely have to decide whether the current record is closer than the prior record. If the prior record is close, re-read it.
This allows a way to read in ALL the variables that are associated with the closest record_date.
data want (drop=_:);
set have ;
by id event_date;
_prior_date=ifn(first.event_date=1,'15oct1582'd,lag(record_date));
/* Find obs after which no future obs can qualify ... */
if ((_prior_date<event_date) and
(event_date<=record_date or last.event_date))
or
(first.event_date=1 and event_date<=record_date);
/* Decide whether to retrieve prior record */
if (event_date-_prior_date)<abs(record_date-event_date) then do;
p=_n_-1;
set have point=p;
end;
run;
The _prior_date var is set to 15oct1582 (beginning of Gregorian calendar) every time a new event_date is encountered. This will prevent the program from reaching back to the prior event.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.