Hello,
I want to join 2 tables by ID and Date such that table 1 Date is matched to the 3 prior consecutive table 2 Dates.
The TIME variable in the below example table is calculated as (TIME=t1.DATE-t2.DATE) and I'm only interested when TIME is 1, 2 and 3. See table below for example data. How can I do this either using proc sql or data steps?
t1.ID | t1.DATE | t2.DATE | TIME | KEEP |
1 | 9/30/2014 | 9/29/2014 | 1 | YES |
1 | 9/30/2014 | 9/28/2014 | 2 | YES |
1 | 9/30/2014 | 9/27/2014 | 3 | YES |
1 | 10/1/2014 | 9/29/2014 | 2 | NO |
1 | 10/1/2014 | 9/28/2014 | 3 | NO |
2 | 10/1/2014 | 9/30/2014 | 1 | YES |
2 | 10/1/2014 | 9/29/2014 | 2 | YES |
2 | 10/1/2014 | 9/28/2014 | 3 | YES |
3 | 10/8/2014 | 10/7/2014 | 1 | NO |
3 | 10/8/2014 | 10/6/2014 | 2 | NO |
3 | 10/10/2014 | 10/9/2014 | 1 | YES |
3 | 10/10/2014 | 10/8/2014 | 2 | YES |
3 | 10/10/2014 | 10/7/2014 | 3 | YES |
4 | 11/3/2014 | 11/2/2014 | 1 | NO |
4 | 11/3/2014 | 11/1/2014 | 2 | NO |
4 | 11/5/2014 | 11/4/2014 | 1 | YES |
4 | 11/5/2014 | 11/3/2014 | 2 | YES |
4 | 11/5/2014 | 11/2/2014 | 3 | YES |
4 | 11/7/2014 | 11/6/2014 | 1 | YES |
4 | 11/7/2014 | 11/5/2014 | 2 | YES |
4 | 11/7/2014 | 11/4/2014 | 3 | YES |
This SAS program uses a hash table to accumulate table 2 records. Then when a table 1 record is in hand and the 3 immediately preceding dates are in the hash, then retrieve from the hash and output:
Notes:
dm 'clear log';
data t1;
input id date :mmddyy10.;
format date yymmddn8.;
datalines;
1 9/30/2014
1 10/1/2014
2 10/1/2014
3 10/8/2014
3 10/10/2014
4 11/3/2014
4 11/5/2014
4 11/7/2014
run;
data t2;
input id date :mmddyy10.;
format date yymmddn8.;
datalines;
1 9/27/2014
1 9/28/2014
1 9/29/2014
2 9/28/2014
2 9/29/2014
2 9/30/2014
3 10/06/2014
3 10/07/2014
3 10/08/2014
3 10/09/2014
4 11/01/2014
4 11/02/2014
4 11/03/2014
4 11/04/2014
4 11/05/2014
4 11/06/2014
run;
data want (drop=rc rename=(date=date2));
if 0 then set t1 (rename=(date=date1)) t2 ; /* set the PDV*/
if _n_=1 then do;
declare hash h (dataset:'t2 (obs=0)');
h.definekey('date');
h.definedata(all:'Y');
h.definedone();
end;
do until (last.id);
set t1 (in=in1) t2 (in=in2) ;
by id date;
if in2 then h.add();
else if in1 then do;
date1=date;
rc=h.check(key:date1-1)
+h.check(key:date1-2)
+h.check(key:date1-3);
if rc=0 then do time=1 to 3;
h.find(key:date1-time);
output;
end;
end;
end;
rc=h.clear();
run;
Try next code:
proc sql;
create table want as select
t.id , t1.date, t2.date
from table1 as t1
join table2 as t2
on t1.id = t2.id and
(t1.date - t2.date) in (1,2,3)
order by id;
quit;
Hi,
For performances reasons, I would adapt your PROC SQL as follow:
proc sql;
create table want as select
t.id , t1.date, t2.date
from table1 as t1
inner join table2 as t2
on t1.id = t2.id
where (t1.date - t2.date) in (1, 2, 3)
order by id;
quit;
Thanks for the where clause clarification. I will need that when working on larger datasets which I often do.
Hi,
This gets very close, thank you. The below code restricts to 1st, 2nd and 3rd prior record but I need to combine records only if there are 3 consecutive dates from table 2. See below table for example of records to keep (yes) and not keep (no). I apologize if I wasn't clear on my original post.
proc sql;
create table WANT as
select distinct t1.ID, t1.DATE, t2.STARTDATE, (t1.DATE-t2.STARTDATE) as TIME, t2.*
from TABLE1 as t1
join TABLE2 as t2 on (t1.ID=t2.ID) and (t1.DATE - t2.STARTDATE) in (1,2,3)
order by ID, Date, STARTDATE desc;
quit;
ID | DATE | STARTDATE | TIME | KEEP | NOTE |
2022 | 9/30/2014 | 9/29/2014 | 1 | YES | This consecutive group contains (1, 2 & 3) |
2022 | 9/30/2014 | 9/28/2014 | 2 | YES | |
2022 | 9/30/2014 | 9/27/2014 | 3 | YES | |
2022 | 10/1/2014 | 9/29/2014 | 2 | NO | This consecutive group does not include (1) |
2022 | 10/1/2014 | 9/28/2014 | 3 | NO | |
2091 | 10/1/2014 | 9/30/2014 | 1 | NO | This non-consecutive group does not include (2) |
2091 | 10/1/2014 | 9/28/2014 | 3 | NO | |
2092 | 10/8/2014 | 10/7/2014 | 1 | NO | This consecutive group does not include (3) |
2092 | 10/8/2014 | 10/6/2014 | 2 | NO | |
2092 | 10/10/2014 | 10/9/2014 | 1 | YES | This consecutive group contains (1, 2 & 3) |
2092 | 10/10/2014 | 10/8/2014 | 2 | YES | |
2092 | 10/10/2014 | 10/7/2014 | 3 | YES | |
2094 | 11/3/2014 | 11/2/2014 | 1 | NO | This consecutive group does not include (3) |
2094 | 11/3/2014 | 11/1/2014 | 2 | NO | |
2094 | 11/5/2014 | 11/2/2014 | 3 | NO | This only contains (3) |
2094 | 11/7/2014 | 11/6/2014 | 1 | YES | This consecutive group contains (1, 2 & 3) |
2094 | 11/7/2014 | 11/5/2014 | 2 | YES | |
2094 | 11/7/2014 | 11/4/2014 | 3 | YES | |
2099 | 12/3/2014 | 12/2/2014 | 1 | NO | This consecutive group does not include (3) |
2099 | 12/3/2014 | 12/1/2014 | 2 | NO |
You can filter non consequtive IDs by counting observations per ID and saving those with count=3.
It will be easyer to be done as second step, like:
proc freq data=want;
table ID / out=temp1(keep=ID _freq_ where=(_FREQ_ = 3));
run;
data want;
merge want
temp1(in=in1);
by ID;
if in1;
run;
This SAS program uses a hash table to accumulate table 2 records. Then when a table 1 record is in hand and the 3 immediately preceding dates are in the hash, then retrieve from the hash and output:
Notes:
dm 'clear log';
data t1;
input id date :mmddyy10.;
format date yymmddn8.;
datalines;
1 9/30/2014
1 10/1/2014
2 10/1/2014
3 10/8/2014
3 10/10/2014
4 11/3/2014
4 11/5/2014
4 11/7/2014
run;
data t2;
input id date :mmddyy10.;
format date yymmddn8.;
datalines;
1 9/27/2014
1 9/28/2014
1 9/29/2014
2 9/28/2014
2 9/29/2014
2 9/30/2014
3 10/06/2014
3 10/07/2014
3 10/08/2014
3 10/09/2014
4 11/01/2014
4 11/02/2014
4 11/03/2014
4 11/04/2014
4 11/05/2014
4 11/06/2014
run;
data want (drop=rc rename=(date=date2));
if 0 then set t1 (rename=(date=date1)) t2 ; /* set the PDV*/
if _n_=1 then do;
declare hash h (dataset:'t2 (obs=0)');
h.definekey('date');
h.definedata(all:'Y');
h.definedone();
end;
do until (last.id);
set t1 (in=in1) t2 (in=in2) ;
by id date;
if in2 then h.add();
else if in1 then do;
date1=date;
rc=h.check(key:date1-1)
+h.check(key:date1-2)
+h.check(key:date1-3);
if rc=0 then do time=1 to 3;
h.find(key:date1-time);
output;
end;
end;
end;
rc=h.clear();
run;
Thank you for all the responses! Again, sorry for my delayed response.
After some testing I finally got your hash code to work with one small modification. The test code worked perfectly, but my actual T2 dataset had another 40+ variables so I added the below to your code. Hash tables are new to me so I'm not totally clear why I needed to add the multidata part. Thank you.
declare hash h (dataset:'t22 (obs=0)', multidata: "Y");
For what it's still worth, here below my small contribution (using simple PROC SQL/Datasteps):
proc sql;
create table have as
select distinct t.id
, t1.date
, t2.date as StartDate
from table1 as t1
inner join table2 as t2
on t1.id = t2.id
where (t1.date - t2.date) in (1, 2, 3)
order by id, date, StartDate desc;
quit;
data WANT_Selection (drop= Previous_: count StartDate Time);
set have;
by ID Date descending StartDate;
retain count;
Previous_Date = LAG1(Date);
Previous_Time = LAG1(Time);
if first.Date then do;
count = 0;
if Time = 1 then count + 1;
end; else
if not first.Date then do;
if Previous_Date = Date and count < 3 then do;
if (Previous_Time = 1 and Time = 2) or (Previous_Time = 2 and Time = 3) then count + 1;
end;
end;
if last.Date and count >= 3 then output;
run;
data want;
merge have (in=inHave)
want_selection (in=inSelection);
by ID Date;
if inSelection;
run;
Once you got that table. That would be easy. data have; infile cards expandtabs truncover; input ID DATE1 : mmddyy10. DATE2 : mmddyy10. TIME; format date1 date2 mmddyy10.; n+1; cards; 1 9/30/2014 9/29/2014 1 YES 1 9/30/2014 9/28/2014 2 YES 1 9/30/2014 9/27/2014 3 YES 1 10/1/2014 9/29/2014 2 NO 1 10/1/2014 9/28/2014 3 NO 2 10/1/2014 9/30/2014 1 YES 2 10/1/2014 9/29/2014 2 YES 2 10/1/2014 9/28/2014 3 YES 3 10/8/2014 10/7/2014 1 NO 3 10/8/2014 10/6/2014 2 NO 3 10/10/2014 10/9/2014 1 YES 3 10/10/2014 10/8/2014 2 YES 3 10/10/2014 10/7/2014 3 YES 4 11/3/2014 11/2/2014 1 NO 4 11/3/2014 11/1/2014 2 NO 4 11/5/2014 11/4/2014 1 YES 4 11/5/2014 11/3/2014 2 YES 4 11/5/2014 11/2/2014 3 YES 4 11/7/2014 11/6/2014 1 YES 4 11/7/2014 11/5/2014 2 YES 4 11/7/2014 11/4/2014 3 ; run; data key; set have; if time=3 and lag(time)=2 and id=lag(id) and lag2(time)=1 and id=lag2(id) then do; output; n=n-1;output; n=n-1;output; end; keep n; run; data want; if _n_=1 then do; if 0 then set key; declare hash h(dataset:'key'); h.definekey('n'); h.definedone(); end; set have; if h.check()=0; drop n; run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.