Solved: Re: Join 2 tables based on unique ID and specific date criteria

WAL83 · Posted 11-30-2016 12:51 PM

Hello,

I want to join 2 tables by ID and Date such that table 1 Date is matched to the 3 prior consecutive table 2 Dates.

The TIME variable in the below example table is calculated as (TIME=t1.DATE-t2.DATE) and I'm only interested when TIME is 1, 2 and 3. See table below for example data. How can I do this either using proc sql or data steps?

t1.ID	t1.DATE	t2.DATE	TIME	KEEP
1	9/30/2014	9/29/2014	1	YES
1	9/30/2014	9/28/2014	2	YES
1	9/30/2014	9/27/2014	3	YES
1	10/1/2014	9/29/2014	2	NO
1	10/1/2014	9/28/2014	3	NO
2	10/1/2014	9/30/2014	1	YES
2	10/1/2014	9/29/2014	2	YES
2	10/1/2014	9/28/2014	3	YES
3	10/8/2014	10/7/2014	1	NO
3	10/8/2014	10/6/2014	2	NO
3	10/10/2014	10/9/2014	1	YES
3	10/10/2014	10/8/2014	2	YES
3	10/10/2014	10/7/2014	3	YES
4	11/3/2014	11/2/2014	1	NO
4	11/3/2014	11/1/2014	2	NO
4	11/5/2014	11/4/2014	1	YES
4	11/5/2014	11/3/2014	2	YES
4	11/5/2014	11/2/2014	3	YES
4	11/7/2014	11/6/2014	1	YES
4	11/7/2014	11/5/2014	2	YES
4	11/7/2014	11/4/2014	3	YES

mkeintz · Posted 12-01-2016 01:37 AM

This SAS program uses a hash table to accumulate table 2 records. Then when a table 1 record is in hand and the 3 immediately preceding dates are in the hash, then retrieve from the hash and output:

Notes:

Both table 1 and table 2 (datasets T1 and T2) are assumed to be sorted by id/date.
The "set t1 t2; by id date;" statements mean that the T1 and T2 records will be interleaved by id/date. And whenever a T1 and T2 record have the same id/date, then the T1 record precedes the tied T2 record.
The hash is cleared when an ID is completely processed.
This will be faster than SQL for any sizable data set.

dm 'clear log';
data t1;
  input id date :mmddyy10.;
  format date yymmddn8.;
datalines;
1 9/30/2014 
1 10/1/2014 
2 10/1/2014 
3 10/8/2014 
3 10/10/2014
4 11/3/2014 
4 11/5/2014 
4 11/7/2014 
run;

data t2;
  input id date :mmddyy10.;
  format date yymmddn8.;
datalines;
1    9/27/2014
1    9/28/2014
1    9/29/2014
2    9/28/2014
2    9/29/2014
2    9/30/2014
3   10/06/2014
3   10/07/2014
3   10/08/2014
3   10/09/2014
4   11/01/2014
4   11/02/2014
4   11/03/2014
4   11/04/2014
4   11/05/2014
4   11/06/2014
run;

data want (drop=rc rename=(date=date2));

  if 0 then set t1 (rename=(date=date1)) t2 ; /* set the PDV*/

  if _n_=1 then do;
    declare hash h (dataset:'t2 (obs=0)');
      h.definekey('date');
      h.definedata(all:'Y');
      h.definedone();
  end;
  
  do until (last.id);
    set t1 (in=in1)  t2 (in=in2) ;
    by id date;

    if in2 then h.add();
    else if in1 then do;
      date1=date; 
      rc=h.check(key:date1-1)
        +h.check(key:date1-2)
        +h.check(key:date1-3);
      if rc=0 then do time=1 to 3;
        h.find(key:date1-time);
        output;
      end;
    end;
  end;
  rc=h.clear();
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

Shmuel · Posted 11-30-2016 02:34 PM

Try next code:

proc sql;

create table want as select

t.id , t1.date, t2.date

from table1 as t1

join table2 as t2

on t1.id = t2.id and

(t1.date - t2.date) in (1,2,3)

order by id;

quit;

Florent · Posted 11-30-2016 05:37 PM

Hi,

For performances reasons, I would adapt your PROC SQL as follow:

proc sql;

create table want as select

t.id , t1.date, t2.date

from table1 as t1

inner join table2 as t2

on t1.id = t2.id

where (t1.date - t2.date) in (1, 2, 3)

order by id;

quit;

WAL83 · Posted 11-30-2016 06:01 PM

Thanks for the where clause clarification. I will need that when working on larger datasets which I often do.

WAL83 · Posted 11-30-2016 05:59 PM

Hi,

This gets very close, thank you. The below code restricts to 1st, 2nd and 3rd prior record but I need to combine records only if there are 3 consecutive dates from table 2. See below table for example of records to keep (yes) and not keep (no). I apologize if I wasn't clear on my original post.

proc sql;
   create table WANT as
   select distinct t1.ID, t1.DATE, t2.STARTDATE, (t1.DATE-t2.STARTDATE) as TIME, t2.*
   from TABLE1 as t1
       join TABLE2 as t2 on (t1.ID=t2.ID) and (t1.DATE - t2.STARTDATE) in (1,2,3)
       order by ID, Date, STARTDATE desc;
quit;

ID	DATE	STARTDATE	TIME	KEEP	NOTE
2022	9/30/2014	9/29/2014	1	YES	This consecutive group contains (1, 2 & 3)
2022	9/30/2014	9/28/2014	2	YES
2022	9/30/2014	9/27/2014	3	YES
2022	10/1/2014	9/29/2014	2	NO	This consecutive group does not include (1)
2022	10/1/2014	9/28/2014	3	NO	This consecutive group does not include (1)
2091	10/1/2014	9/30/2014	1	NO	This non-consecutive group does not include (2)
2091	10/1/2014	9/28/2014	3	NO	This non-consecutive group does not include (2)
2092	10/8/2014	10/7/2014	1	NO	This consecutive group does not include (3)
2092	10/8/2014	10/6/2014	2	NO	This consecutive group does not include (3)
2092	10/10/2014	10/9/2014	1	YES	This consecutive group contains (1, 2 & 3)
2092	10/10/2014	10/8/2014	2	YES
2092	10/10/2014	10/7/2014	3	YES
2094	11/3/2014	11/2/2014	1	NO	This consecutive group does not include (3)
2094	11/3/2014	11/1/2014	2	NO	This consecutive group does not include (3)
2094	11/5/2014	11/2/2014	3	NO	This only contains (3)
2094	11/7/2014	11/6/2014	1	YES	This consecutive group contains (1, 2 & 3)
2094	11/7/2014	11/5/2014	2	YES
2094	11/7/2014	11/4/2014	3	YES
2099	12/3/2014	12/2/2014	1	NO	This consecutive group does not include (3)
2099	12/3/2014	12/1/2014	2	NO	This consecutive group does not include (3)

Shmuel · Posted 11-30-2016 11:06 PM

You can filter non consequtive IDs by counting observations per ID and saving those with count=3.

It will be easyer to be done as second step, like:

proc freq data=want;

table ID / out=temp1(keep=ID _freq_ where=(_FREQ_ = 3));

run;

data want;

merge want

temp1(in=in1);

by ID;

if in1;

run;

mkeintz · Posted 12-01-2016 01:37 AM

This SAS program uses a hash table to accumulate table 2 records. Then when a table 1 record is in hand and the 3 immediately preceding dates are in the hash, then retrieve from the hash and output:

Notes:

Both table 1 and table 2 (datasets T1 and T2) are assumed to be sorted by id/date.
The "set t1 t2; by id date;" statements mean that the T1 and T2 records will be interleaved by id/date. And whenever a T1 and T2 record have the same id/date, then the T1 record precedes the tied T2 record.
The hash is cleared when an ID is completely processed.
This will be faster than SQL for any sizable data set.

dm 'clear log';
data t1;
  input id date :mmddyy10.;
  format date yymmddn8.;
datalines;
1 9/30/2014 
1 10/1/2014 
2 10/1/2014 
3 10/8/2014 
3 10/10/2014
4 11/3/2014 
4 11/5/2014 
4 11/7/2014 
run;

data t2;
  input id date :mmddyy10.;
  format date yymmddn8.;
datalines;
1    9/27/2014
1    9/28/2014
1    9/29/2014
2    9/28/2014
2    9/29/2014
2    9/30/2014
3   10/06/2014
3   10/07/2014
3   10/08/2014
3   10/09/2014
4   11/01/2014
4   11/02/2014
4   11/03/2014
4   11/04/2014
4   11/05/2014
4   11/06/2014
run;

data want (drop=rc rename=(date=date2));

  if 0 then set t1 (rename=(date=date1)) t2 ; /* set the PDV*/

  if _n_=1 then do;
    declare hash h (dataset:'t2 (obs=0)');
      h.definekey('date');
      h.definedata(all:'Y');
      h.definedone();
  end;
  
  do until (last.id);
    set t1 (in=in1)  t2 (in=in2) ;
    by id date;

    if in2 then h.add();
    else if in1 then do;
      date1=date; 
      rc=h.check(key:date1-1)
        +h.check(key:date1-2)
        +h.check(key:date1-3);
      if rc=0 then do time=1 to 3;
        h.find(key:date1-time);
        output;
      end;
    end;
  end;
  rc=h.clear();
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

WAL83 · Posted 12-01-2016 05:42 PM

Thank you for all the responses! Again, sorry for my delayed response.

After some testing I finally got your hash code to work with one small modification. The test code worked perfectly, but my actual T2 dataset had another 40+ variables so I added the below to your code. Hash tables are new to me so I'm not totally clear why I needed to add the multidata part. Thank you.

declare hash h (dataset:'t22 (obs=0)', multidata: "Y");

Florent · Posted 12-01-2016 02:15 PM

For what it's still worth, here below my small contribution (using simple PROC SQL/Datasteps):

proc sql;
     create table have as
         select  distinct t.id
         ,          t1.date
         ,          t2.date as StartDate
    from table1 as t1
    inner join table2 as t2
         on t1.id = t2.id
    where (t1.date - t2.date) in (1, 2, 3)
    order by id, date, StartDate desc;
quit;

data WANT_Selection (drop= Previous_: count StartDate Time);
	set have;
	by ID Date descending StartDate;

	retain count;

	Previous_Date = LAG1(Date);
	Previous_Time = LAG1(Time);

	if first.Date then do;
		count = 0;
		if Time = 1 then count + 1;
	end; else
	if not first.Date then do;
		if Previous_Date = Date and count < 3 then do;
			if (Previous_Time = 1 and Time = 2) or (Previous_Time = 2 and Time = 3) then count + 1;
		end;
	end;

	if last.Date and count >= 3 then output;
run;

data want;
	merge have 		   (in=inHave)
		   want_selection (in=inSelection);
	by ID Date;

	if inSelection;
run;

Ksharp · Posted 12-01-2016 05:42 AM

Once you got that table. That would be easy.


data have;
infile cards expandtabs truncover;
input ID	DATE1 : mmddyy10.	DATE2 : mmddyy10.	TIME;
format date1 date2 mmddyy10.;
n+1;
cards;
1	9/30/2014	9/29/2014	1	YES
1	9/30/2014	9/28/2014	2	YES
1	9/30/2014	9/27/2014	3	YES
1	10/1/2014	9/29/2014	2	NO
1	10/1/2014	9/28/2014	3	NO
2	10/1/2014	9/30/2014	1	YES
2	10/1/2014	9/29/2014	2	YES
2	10/1/2014	9/28/2014	3	YES
3	10/8/2014	10/7/2014	1	NO
3	10/8/2014	10/6/2014	2	NO
3	10/10/2014	10/9/2014	1	YES
3	10/10/2014	10/8/2014	2	YES
3	10/10/2014	10/7/2014	3	YES
4	11/3/2014	11/2/2014	1	NO
4	11/3/2014	11/1/2014	2	NO
4	11/5/2014	11/4/2014	1	YES
4	11/5/2014	11/3/2014	2	YES
4	11/5/2014	11/2/2014	3	YES
4	11/7/2014	11/6/2014	1	YES
4	11/7/2014	11/5/2014	2	YES
4	11/7/2014	11/4/2014	3
;
run;
data key;
 set have;
 if time=3 and
    lag(time)=2 and id=lag(id) and
    lag2(time)=1 and id=lag2(id) then do;
 output;
 n=n-1;output;
 n=n-1;output;
 end;
keep n;
run;
data want;
 if _n_=1 then do;
  if 0 then set key;
  declare hash h(dataset:'key');
  h.definekey('n');
  h.definedone();
 end;
set have;
if h.check()=0;
drop n;
run;

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away