Solved: Re: Joining on Nearest Timestamp

lujo1017 · Posted 02-20-2022 11:31 AM

I have 2 datasets that I have to join based on same account key and nearest timestamp from same date to obtain the transaction ID from the larger dataset. Table A only has about a million records, but Table B has 1.5 billion. I've attempted the below code format and error out due to inadequate session memory. This occurs even if I invoke max MEMSIZE at session launch. There are multiple instances of timestamps per account and the timestamps are off by a few seconds between each dataset, so I can't do an exact join on account and timestamps. Is there a more efficient way to accomplish this?

proc sql;

create table timematch

as select a*, b.transactionid

, a.datetime - b.datetime as seconds

, abs(calculated seconds) as distance

from tablea a

left join tableb b

on a.account=b.account

and a.date=datepart(b.datetime)

group by a.account, a.date, a.datetime

having min(calculated distance) = calculated distance;

quit;

Sajid01 · Posted 02-20-2022 02:10 PM

I would suggest that you select a max value of time stamp difference and modify your query as follows.

%let tolerance = ;/* select the value you can tolerate*/
proc sql; 
create table timematch as
 select a*, b.transactionid,
abs( timepart(a.datetime) - timepart(b.datetime) ) as distance
from table a  left join table b 
on a.account=b.account
and datepart(a.datetime)=datepart(b.datetime)
group by a.account, a.date, a.datetime
having calculated distance) le &tolerance;
quit;

View solution in original post

PeterClemmensen · Posted 02-20-2022 01:18 PM

There is probably a better way that will increase performance and save time. Are you able to post some example data that resembles your actual data?

The DATA to DATA Step Macro
Blog: SASnrd

lujo1017 · Posted 02-20-2022 02:07 PM

Below are a few sample records from each table. Transaction_Amts between the 2 datasets do not line up due to conversion rates and other factors in creation of the data in table A. Timestamps may be off

Table A Sample:

ACCOUNT_ID	TRANSACTION_DATE	DATE_TIME	TRANSACTION_AMT
0000165033E8A3EFC15	06/12/2021	06Dec2021 22:57:42	1,599.00
00005D318BCCCE8AA93	23/12/2021	23Dec2021 18:54:00	1.00
0000653BAA33A87F242	30/10/2021	30Oct2021 14:27:00	14.90
00009F79D3AC781E7EE	19/09/2021	19Sep2021 18:04:58	30.00
00009F79D3AC781E7EE	19/09/2021	19Sep2021 18:05:58	30.00

Table B Sample

TRANSACTION_DTTM	TRANSACTION_ID	ACCOUNT_ID	TRANSACTION_AMT
30Jul2021 21:00:16	6c6103ef9b044e1c	465EE79A6412B62C825	0
30Jul2021 21:00:16	046103ef98044df9	77987B571FCF27B5633	0
30Jul2021 21:00:16	686103ee4404491c	A16EF924DB414AB3D3F	17.9
30Jul2021 21:00:16	006103ef98044b13	1B4EE07C3778C8CB03A	0.78
30Jul2021 21:00:16	64610435d3027617	22CFD191C879FE130EF	0.38
30Jul2021 21:00:16	6c6103ef9b044e1d	1747E9B180376F04F72	0
30Jul2021 21:00:10	386103eea4044cbe	1F1EF0192FB7F1C8317	1.26
30Jul2021 21:00:10	6c6103ef9b044dfd	F494EED4074847E41A8	180.19

Sajid01 · Posted 02-20-2022 02:32 PM

Can you put these in a datastep code?

PeterClemmensen · Posted 02-20-2022 03:13 PM

So what does your desired result look like given this data?

The DATA to DATA Step Macro
Blog: SASnrd

lujo1017 · Posted 02-20-2022 03:33 PM

I'd like a table of everything from Table A, plus the transaction ID from table B that matches on account ID and date exactly, with closest match on timestamp. Amounts are unreliable, so they're not very useful to me in joining.

PeterClemmensen · Posted 02-20-2022 01:47 PM

See if you can use this as an example. Should scale well.

data one;
input id n1;
datalines;
1 3.4 
1 7.9 
1 6.1 
2 4.7 
2 9.2 
3 6.3 
3 1.1 
3 8.3 
;
 
proc plan;
   factors id = 3 n2 = 5 / noprint;
   output out=two;
run;quit;

data want(keep = id n2 closest);
   if _N_ = 1 then do;
      if 0 then set one;
      dcl hash h(dataset : "one(rename = (id = _id)", ordered : "Y");
      h.definekey("_id", "n1");
      h.definedone();
      dcl hiter i("h");
   end;
 
   set two;
   _id = .;
 
   if h.check(key : id, key : n2) = 0 then closest = n2;
 
   else do;
      h.add(key : id, key : n2, data : id, data : n2);
 
      if i.setcur(key : id, key : n2) = 0 then
         if i.prev() = 0 & id = _id then pn1 = n1;
 
      if i.setcur(key : id, key : n2) = 0 then
         if i.next() = 0 & id = _id then nn1 = n1;
 
      if      nn1 = . then idx = 1;
      else if pn1 = . then idx = 2;
      else idx = 1 + (n2 - pn1 > nn1 - n2);
 
      closest = choosen(idx, pn1, nn1);
 
      h.remove(key : id, key : n2);
   end;
run;

The DATA to DATA Step Macro
Blog: SASnrd

Sajid01 · Posted 02-20-2022 02:10 PM

I would suggest that you select a max value of time stamp difference and modify your query as follows.

%let tolerance = ;/* select the value you can tolerate*/
proc sql; 
create table timematch as
 select a*, b.transactionid,
abs( timepart(a.datetime) - timepart(b.datetime) ) as distance
from table a  left join table b 
on a.account=b.account
and datepart(a.datetime)=datepart(b.datetime)
group by a.account, a.date, a.datetime
having calculated distance) le &tolerance;
quit;

Sajid01 · Posted 02-20-2022 10:14 PM

Hello @lujo1017
The account_id's in the two tables you have posted are different.
Therefore the joined data set will only have data from the left table.
Please put a representative data as a sas datastep and make sure the datasets have some rows which fulfill the join condition.

Registration is open

SAS Training: Just a Click Away