BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
lujo1017
Fluorite | Level 6

I have 2 datasets that I have to join based on same account key and nearest timestamp from same date to obtain the transaction ID from the larger dataset. Table A only has about a million records, but Table B has 1.5 billion. I've attempted the below code format and error out due to inadequate session memory. This occurs even if I invoke max MEMSIZE at session launch. There are multiple instances of timestamps per account and the timestamps are off by a few seconds between each dataset, so I can't do an exact join on account and timestamps. Is there a more efficient way to accomplish this? 

 

proc sql; 

 

create table timematch 

as select a*, b.transactionid

, a.datetime - b.datetime as seconds

, abs(calculated seconds) as distance 

from tablea a

   left join tableb b 

   on a.account=b.account

   and a.date=datepart(b.datetime)

group by a.account, a.date, a.datetime

having min(calculated distance) = calculated distance;

 

quit;

  

1 ACCEPTED SOLUTION

Accepted Solutions
Sajid01
Meteorite | Level 14

I would suggest that you select a max value of time stamp difference and modify your query as follows.

%let tolerance = ;/* select the value you can tolerate*/
proc sql; 
create table timematch as
 select a*, b.transactionid,
abs( timepart(a.datetime) - timepart(b.datetime) ) as distance
from table a  left join table b 
on a.account=b.account
and datepart(a.datetime)=datepart(b.datetime)
group by a.account, a.date, a.datetime
having calculated distance) le &tolerance;
quit;

View solution in original post

8 REPLIES 8
PeterClemmensen
Tourmaline | Level 20

There is probably a better way that will increase performance and save time. Are you able to post some example data that resembles your actual data?

lujo1017
Fluorite | Level 6

Below are a few sample records from each table. Transaction_Amts between the 2 datasets do not line up due to conversion rates and other factors in creation of the data in table A. Timestamps may be off

 

Table A Sample: 

 

ACCOUNT_ID

TRANSACTION_DATE

DATE_TIME

TRANSACTION_AMT

0000165033E8A3EFC15

06/12/2021

06Dec2021 22:57:42

1,599.00

00005D318BCCCE8AA93

23/12/2021

23Dec2021 18:54:00

1.00

0000653BAA33A87F242

30/10/2021

30Oct2021 14:27:00

14.90

00009F79D3AC781E7EE

19/09/2021

19Sep2021 18:04:58

30.00

00009F79D3AC781E7EE

19/09/2021

19Sep2021 18:05:58

30.00

 

Table B Sample

 

TRANSACTION_DTTM

TRANSACTION_ID

ACCOUNT_ID

TRANSACTION_AMT

30Jul2021 21:00:16

6c6103ef9b044e1c

465EE79A6412B62C825

0

30Jul2021 21:00:16

046103ef98044df9

77987B571FCF27B5633

0

30Jul2021 21:00:16

686103ee4404491c

A16EF924DB414AB3D3F

17.9

30Jul2021 21:00:16

006103ef98044b13

1B4EE07C3778C8CB03A

0.78

30Jul2021 21:00:16

64610435d3027617

22CFD191C879FE130EF

0.38

30Jul2021 21:00:16

6c6103ef9b044e1d

1747E9B180376F04F72

0

30Jul2021 21:00:10

386103eea4044cbe

1F1EF0192FB7F1C8317

1.26

30Jul2021 21:00:10

6c6103ef9b044dfd

F494EED4074847E41A8

180.19

Sajid01
Meteorite | Level 14

Can you put these in a datastep code?

PeterClemmensen
Tourmaline | Level 20

So what does your desired result look like given this data?

lujo1017
Fluorite | Level 6

I'd like a table of everything from Table A, plus the transaction ID from table B that matches on account ID and date exactly, with closest match on timestamp. Amounts are unreliable, so they're not very useful to me in joining. 

PeterClemmensen
Tourmaline | Level 20

See if you can use this as an example. Should scale well.

 

data one;
input id n1;
datalines;
1 3.4 
1 7.9 
1 6.1 
2 4.7 
2 9.2 
3 6.3 
3 1.1 
3 8.3 
;
 
proc plan;
   factors id = 3 n2 = 5 / noprint;
   output out=two;
run;quit;

data want(keep = id n2 closest);
   if _N_ = 1 then do;
      if 0 then set one;
      dcl hash h(dataset : "one(rename = (id = _id)", ordered : "Y");
      h.definekey("_id", "n1");
      h.definedone();
      dcl hiter i("h");
   end;
 
   set two;
   _id = .;
 
   if h.check(key : id, key : n2) = 0 then closest = n2;
 
   else do;
      h.add(key : id, key : n2, data : id, data : n2);
 
      if i.setcur(key : id, key : n2) = 0 then
         if i.prev() = 0 & id = _id then pn1 = n1;
 
      if i.setcur(key : id, key : n2) = 0 then
         if i.next() = 0 & id = _id then nn1 = n1;
 
      if      nn1 = . then idx = 1;
      else if pn1 = . then idx = 2;
      else idx = 1 + (n2 - pn1 > nn1 - n2);
 
      closest = choosen(idx, pn1, nn1);
 
      h.remove(key : id, key : n2);
   end;
run;
Sajid01
Meteorite | Level 14

I would suggest that you select a max value of time stamp difference and modify your query as follows.

%let tolerance = ;/* select the value you can tolerate*/
proc sql; 
create table timematch as
 select a*, b.transactionid,
abs( timepart(a.datetime) - timepart(b.datetime) ) as distance
from table a  left join table b 
on a.account=b.account
and datepart(a.datetime)=datepart(b.datetime)
group by a.account, a.date, a.datetime
having calculated distance) le &tolerance;
quit;
Sajid01
Meteorite | Level 14

Hello @lujo1017 
The account_id's in the two tables you have posted are different.
Therefore the joined data set will only have data from the left table.
Please put a representative data as a sas datastep and make sure the datasets have some rows which fulfill the join condition.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 8 replies
  • 5188 views
  • 1 like
  • 3 in conversation