BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
lujo1017
Fluorite | Level 6

I have 2 datasets that I have to join based on same account key and nearest timestamp from same date to obtain the transaction ID from the larger dataset. Table A only has about a million records, but Table B has 1.5 billion. I've attempted the below code format and error out due to inadequate session memory. This occurs even if I invoke max MEMSIZE at session launch. There are multiple instances of timestamps per account and the timestamps are off by a few seconds between each dataset, so I can't do an exact join on account and timestamps. Is there a more efficient way to accomplish this? 

 

proc sql; 

 

create table timematch 

as select a*, b.transactionid

, a.datetime - b.datetime as seconds

, abs(calculated seconds) as distance 

from tablea a

   left join tableb b 

   on a.account=b.account

   and a.date=datepart(b.datetime)

group by a.account, a.date, a.datetime

having min(calculated distance) = calculated distance;

 

quit;

  

1 ACCEPTED SOLUTION

Accepted Solutions
Sajid01
Meteorite | Level 14

I would suggest that you select a max value of time stamp difference and modify your query as follows.

%let tolerance = ;/* select the value you can tolerate*/
proc sql; 
create table timematch as
 select a*, b.transactionid,
abs( timepart(a.datetime) - timepart(b.datetime) ) as distance
from table a  left join table b 
on a.account=b.account
and datepart(a.datetime)=datepart(b.datetime)
group by a.account, a.date, a.datetime
having calculated distance) le &tolerance;
quit;

View solution in original post

8 REPLIES 8
PeterClemmensen
Tourmaline | Level 20

There is probably a better way that will increase performance and save time. Are you able to post some example data that resembles your actual data?

lujo1017
Fluorite | Level 6

Below are a few sample records from each table. Transaction_Amts between the 2 datasets do not line up due to conversion rates and other factors in creation of the data in table A. Timestamps may be off

 

Table A Sample: 

 

ACCOUNT_ID

TRANSACTION_DATE

DATE_TIME

TRANSACTION_AMT

0000165033E8A3EFC15

06/12/2021

06Dec2021 22:57:42

1,599.00

00005D318BCCCE8AA93

23/12/2021

23Dec2021 18:54:00

1.00

0000653BAA33A87F242

30/10/2021

30Oct2021 14:27:00

14.90

00009F79D3AC781E7EE

19/09/2021

19Sep2021 18:04:58

30.00

00009F79D3AC781E7EE

19/09/2021

19Sep2021 18:05:58

30.00

 

Table B Sample

 

TRANSACTION_DTTM

TRANSACTION_ID

ACCOUNT_ID

TRANSACTION_AMT

30Jul2021 21:00:16

6c6103ef9b044e1c

465EE79A6412B62C825

0

30Jul2021 21:00:16

046103ef98044df9

77987B571FCF27B5633

0

30Jul2021 21:00:16

686103ee4404491c

A16EF924DB414AB3D3F

17.9

30Jul2021 21:00:16

006103ef98044b13

1B4EE07C3778C8CB03A

0.78

30Jul2021 21:00:16

64610435d3027617

22CFD191C879FE130EF

0.38

30Jul2021 21:00:16

6c6103ef9b044e1d

1747E9B180376F04F72

0

30Jul2021 21:00:10

386103eea4044cbe

1F1EF0192FB7F1C8317

1.26

30Jul2021 21:00:10

6c6103ef9b044dfd

F494EED4074847E41A8

180.19

Sajid01
Meteorite | Level 14

Can you put these in a datastep code?

PeterClemmensen
Tourmaline | Level 20

So what does your desired result look like given this data?

lujo1017
Fluorite | Level 6

I'd like a table of everything from Table A, plus the transaction ID from table B that matches on account ID and date exactly, with closest match on timestamp. Amounts are unreliable, so they're not very useful to me in joining. 

PeterClemmensen
Tourmaline | Level 20

See if you can use this as an example. Should scale well.

 

data one;
input id n1;
datalines;
1 3.4 
1 7.9 
1 6.1 
2 4.7 
2 9.2 
3 6.3 
3 1.1 
3 8.3 
;
 
proc plan;
   factors id = 3 n2 = 5 / noprint;
   output out=two;
run;quit;

data want(keep = id n2 closest);
   if _N_ = 1 then do;
      if 0 then set one;
      dcl hash h(dataset : "one(rename = (id = _id)", ordered : "Y");
      h.definekey("_id", "n1");
      h.definedone();
      dcl hiter i("h");
   end;
 
   set two;
   _id = .;
 
   if h.check(key : id, key : n2) = 0 then closest = n2;
 
   else do;
      h.add(key : id, key : n2, data : id, data : n2);
 
      if i.setcur(key : id, key : n2) = 0 then
         if i.prev() = 0 & id = _id then pn1 = n1;
 
      if i.setcur(key : id, key : n2) = 0 then
         if i.next() = 0 & id = _id then nn1 = n1;
 
      if      nn1 = . then idx = 1;
      else if pn1 = . then idx = 2;
      else idx = 1 + (n2 - pn1 > nn1 - n2);
 
      closest = choosen(idx, pn1, nn1);
 
      h.remove(key : id, key : n2);
   end;
run;
Sajid01
Meteorite | Level 14

I would suggest that you select a max value of time stamp difference and modify your query as follows.

%let tolerance = ;/* select the value you can tolerate*/
proc sql; 
create table timematch as
 select a*, b.transactionid,
abs( timepart(a.datetime) - timepart(b.datetime) ) as distance
from table a  left join table b 
on a.account=b.account
and datepart(a.datetime)=datepart(b.datetime)
group by a.account, a.date, a.datetime
having calculated distance) le &tolerance;
quit;
Sajid01
Meteorite | Level 14

Hello @lujo1017 
The account_id's in the two tables you have posted are different.
Therefore the joined data set will only have data from the left table.
Please put a representative data as a sas datastep and make sure the datasets have some rows which fulfill the join condition.

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 8 replies
  • 5068 views
  • 1 like
  • 3 in conversation