Hello,
I want to join 2 tables by ID and Date such that table 1 Date is matched to the 3 prior consecutive table 2 Dates.
The TIME variable in the below example table is calculated as (TIME=t1.DATE-t2.DATE) and I'm only interested when TIME is 1, 2 and 3. See table below for example data. How can I do this either using proc sql or data steps?
t1.ID | t1.DATE | t2.DATE | TIME | KEEP |
1 | 9/30/2014 | 9/29/2014 | 1 | YES |
1 | 9/30/2014 | 9/28/2014 | 2 | YES |
1 | 9/30/2014 | 9/27/2014 | 3 | YES |
1 | 10/1/2014 | 9/29/2014 | 2 | NO |
1 | 10/1/2014 | 9/28/2014 | 3 | NO |
2 | 10/1/2014 | 9/30/2014 | 1 | YES |
2 | 10/1/2014 | 9/29/2014 | 2 | YES |
2 | 10/1/2014 | 9/28/2014 | 3 | YES |
3 | 10/8/2014 | 10/7/2014 | 1 | NO |
3 | 10/8/2014 | 10/6/2014 | 2 | NO |
3 | 10/10/2014 | 10/9/2014 | 1 | YES |
3 | 10/10/2014 | 10/8/2014 | 2 | YES |
3 | 10/10/2014 | 10/7/2014 | 3 | YES |
4 | 11/3/2014 | 11/2/2014 | 1 | NO |
4 | 11/3/2014 | 11/1/2014 | 2 | NO |
4 | 11/5/2014 | 11/4/2014 | 1 | YES |
4 | 11/5/2014 | 11/3/2014 | 2 | YES |
4 | 11/5/2014 | 11/2/2014 | 3 | YES |
4 | 11/7/2014 | 11/6/2014 | 1 | YES |
4 | 11/7/2014 | 11/5/2014 | 2 | YES |
4 | 11/7/2014 | 11/4/2014 | 3 | YES |
This SAS program uses a hash table to accumulate table 2 records. Then when a table 1 record is in hand and the 3 immediately preceding dates are in the hash, then retrieve from the hash and output:
Notes:
dm 'clear log';
data t1;
input id date :mmddyy10.;
format date yymmddn8.;
datalines;
1 9/30/2014
1 10/1/2014
2 10/1/2014
3 10/8/2014
3 10/10/2014
4 11/3/2014
4 11/5/2014
4 11/7/2014
run;
data t2;
input id date :mmddyy10.;
format date yymmddn8.;
datalines;
1 9/27/2014
1 9/28/2014
1 9/29/2014
2 9/28/2014
2 9/29/2014
2 9/30/2014
3 10/06/2014
3 10/07/2014
3 10/08/2014
3 10/09/2014
4 11/01/2014
4 11/02/2014
4 11/03/2014
4 11/04/2014
4 11/05/2014
4 11/06/2014
run;
data want (drop=rc rename=(date=date2));
if 0 then set t1 (rename=(date=date1)) t2 ; /* set the PDV*/
if _n_=1 then do;
declare hash h (dataset:'t2 (obs=0)');
h.definekey('date');
h.definedata(all:'Y');
h.definedone();
end;
do until (last.id);
set t1 (in=in1) t2 (in=in2) ;
by id date;
if in2 then h.add();
else if in1 then do;
date1=date;
rc=h.check(key:date1-1)
+h.check(key:date1-2)
+h.check(key:date1-3);
if rc=0 then do time=1 to 3;
h.find(key:date1-time);
output;
end;
end;
end;
rc=h.clear();
run;
Try next code:
proc sql;
create table want as select
t.id , t1.date, t2.date
from table1 as t1
join table2 as t2
on t1.id = t2.id and
(t1.date - t2.date) in (1,2,3)
order by id;
quit;
Hi,
For performances reasons, I would adapt your PROC SQL as follow:
proc sql;
create table want as select
t.id , t1.date, t2.date
from table1 as t1
inner join table2 as t2
on t1.id = t2.id
where (t1.date - t2.date) in (1, 2, 3)
order by id;
quit;
Thanks for the where clause clarification. I will need that when working on larger datasets which I often do.
Hi,
This gets very close, thank you. The below code restricts to 1st, 2nd and 3rd prior record but I need to combine records only if there are 3 consecutive dates from table 2. See below table for example of records to keep (yes) and not keep (no). I apologize if I wasn't clear on my original post.
proc sql;
create table WANT as
select distinct t1.ID, t1.DATE, t2.STARTDATE, (t1.DATE-t2.STARTDATE) as TIME, t2.*
from TABLE1 as t1
join TABLE2 as t2 on (t1.ID=t2.ID) and (t1.DATE - t2.STARTDATE) in (1,2,3)
order by ID, Date, STARTDATE desc;
quit;
ID | DATE | STARTDATE | TIME | KEEP | NOTE |
2022 | 9/30/2014 | 9/29/2014 | 1 | YES | This consecutive group contains (1, 2 & 3) |
2022 | 9/30/2014 | 9/28/2014 | 2 | YES | |
2022 | 9/30/2014 | 9/27/2014 | 3 | YES | |
2022 | 10/1/2014 | 9/29/2014 | 2 | NO | This consecutive group does not include (1) |
2022 | 10/1/2014 | 9/28/2014 | 3 | NO | |
2091 | 10/1/2014 | 9/30/2014 | 1 | NO | This non-consecutive group does not include (2) |
2091 | 10/1/2014 | 9/28/2014 | 3 | NO | |
2092 | 10/8/2014 | 10/7/2014 | 1 | NO | This consecutive group does not include (3) |
2092 | 10/8/2014 | 10/6/2014 | 2 | NO | |
2092 | 10/10/2014 | 10/9/2014 | 1 | YES | This consecutive group contains (1, 2 & 3) |
2092 | 10/10/2014 | 10/8/2014 | 2 | YES | |
2092 | 10/10/2014 | 10/7/2014 | 3 | YES | |
2094 | 11/3/2014 | 11/2/2014 | 1 | NO | This consecutive group does not include (3) |
2094 | 11/3/2014 | 11/1/2014 | 2 | NO | |
2094 | 11/5/2014 | 11/2/2014 | 3 | NO | This only contains (3) |
2094 | 11/7/2014 | 11/6/2014 | 1 | YES | This consecutive group contains (1, 2 & 3) |
2094 | 11/7/2014 | 11/5/2014 | 2 | YES | |
2094 | 11/7/2014 | 11/4/2014 | 3 | YES | |
2099 | 12/3/2014 | 12/2/2014 | 1 | NO | This consecutive group does not include (3) |
2099 | 12/3/2014 | 12/1/2014 | 2 | NO |
You can filter non consequtive IDs by counting observations per ID and saving those with count=3.
It will be easyer to be done as second step, like:
proc freq data=want;
table ID / out=temp1(keep=ID _freq_ where=(_FREQ_ = 3));
run;
data want;
merge want
temp1(in=in1);
by ID;
if in1;
run;
This SAS program uses a hash table to accumulate table 2 records. Then when a table 1 record is in hand and the 3 immediately preceding dates are in the hash, then retrieve from the hash and output:
Notes:
dm 'clear log';
data t1;
input id date :mmddyy10.;
format date yymmddn8.;
datalines;
1 9/30/2014
1 10/1/2014
2 10/1/2014
3 10/8/2014
3 10/10/2014
4 11/3/2014
4 11/5/2014
4 11/7/2014
run;
data t2;
input id date :mmddyy10.;
format date yymmddn8.;
datalines;
1 9/27/2014
1 9/28/2014
1 9/29/2014
2 9/28/2014
2 9/29/2014
2 9/30/2014
3 10/06/2014
3 10/07/2014
3 10/08/2014
3 10/09/2014
4 11/01/2014
4 11/02/2014
4 11/03/2014
4 11/04/2014
4 11/05/2014
4 11/06/2014
run;
data want (drop=rc rename=(date=date2));
if 0 then set t1 (rename=(date=date1)) t2 ; /* set the PDV*/
if _n_=1 then do;
declare hash h (dataset:'t2 (obs=0)');
h.definekey('date');
h.definedata(all:'Y');
h.definedone();
end;
do until (last.id);
set t1 (in=in1) t2 (in=in2) ;
by id date;
if in2 then h.add();
else if in1 then do;
date1=date;
rc=h.check(key:date1-1)
+h.check(key:date1-2)
+h.check(key:date1-3);
if rc=0 then do time=1 to 3;
h.find(key:date1-time);
output;
end;
end;
end;
rc=h.clear();
run;
Thank you for all the responses! Again, sorry for my delayed response.
After some testing I finally got your hash code to work with one small modification. The test code worked perfectly, but my actual T2 dataset had another 40+ variables so I added the below to your code. Hash tables are new to me so I'm not totally clear why I needed to add the multidata part. Thank you.
declare hash h (dataset:'t22 (obs=0)', multidata: "Y");
For what it's still worth, here below my small contribution (using simple PROC SQL/Datasteps):
proc sql;
create table have as
select distinct t.id
, t1.date
, t2.date as StartDate
from table1 as t1
inner join table2 as t2
on t1.id = t2.id
where (t1.date - t2.date) in (1, 2, 3)
order by id, date, StartDate desc;
quit;
data WANT_Selection (drop= Previous_: count StartDate Time);
set have;
by ID Date descending StartDate;
retain count;
Previous_Date = LAG1(Date);
Previous_Time = LAG1(Time);
if first.Date then do;
count = 0;
if Time = 1 then count + 1;
end; else
if not first.Date then do;
if Previous_Date = Date and count < 3 then do;
if (Previous_Time = 1 and Time = 2) or (Previous_Time = 2 and Time = 3) then count + 1;
end;
end;
if last.Date and count >= 3 then output;
run;
data want;
merge have (in=inHave)
want_selection (in=inSelection);
by ID Date;
if inSelection;
run;
Once you got that table. That would be easy. data have; infile cards expandtabs truncover; input ID DATE1 : mmddyy10. DATE2 : mmddyy10. TIME; format date1 date2 mmddyy10.; n+1; cards; 1 9/30/2014 9/29/2014 1 YES 1 9/30/2014 9/28/2014 2 YES 1 9/30/2014 9/27/2014 3 YES 1 10/1/2014 9/29/2014 2 NO 1 10/1/2014 9/28/2014 3 NO 2 10/1/2014 9/30/2014 1 YES 2 10/1/2014 9/29/2014 2 YES 2 10/1/2014 9/28/2014 3 YES 3 10/8/2014 10/7/2014 1 NO 3 10/8/2014 10/6/2014 2 NO 3 10/10/2014 10/9/2014 1 YES 3 10/10/2014 10/8/2014 2 YES 3 10/10/2014 10/7/2014 3 YES 4 11/3/2014 11/2/2014 1 NO 4 11/3/2014 11/1/2014 2 NO 4 11/5/2014 11/4/2014 1 YES 4 11/5/2014 11/3/2014 2 YES 4 11/5/2014 11/2/2014 3 YES 4 11/7/2014 11/6/2014 1 YES 4 11/7/2014 11/5/2014 2 YES 4 11/7/2014 11/4/2014 3 ; run; data key; set have; if time=3 and lag(time)=2 and id=lag(id) and lag2(time)=1 and id=lag2(id) then do; output; n=n-1;output; n=n-1;output; end; keep n; run; data want; if _n_=1 then do; if 0 then set key; declare hash h(dataset:'key'); h.definekey('n'); h.definedone(); end; set have; if h.check()=0; drop n; run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.