I have 2 datasets that I have to join based on same account key and nearest timestamp from same date to obtain the transaction ID from the larger dataset. Table A only has about a million records, but Table B has 1.5 billion. I've attempted the below code format and error out due to inadequate session memory. This occurs even if I invoke max MEMSIZE at session launch. There are multiple instances of timestamps per account and the timestamps are off by a few seconds between each dataset, so I can't do an exact join on account and timestamps. Is there a more efficient way to accomplish this?
proc sql;
create table timematch
as select a*, b.transactionid
, a.datetime - b.datetime as seconds
, abs(calculated seconds) as distance
from tablea a
left join tableb b
on a.account=b.account
and a.date=datepart(b.datetime)
group by a.account, a.date, a.datetime
having min(calculated distance) = calculated distance;
quit;
I would suggest that you select a max value of time stamp difference and modify your query as follows.
%let tolerance = ;/* select the value you can tolerate*/
proc sql;
create table timematch as
select a*, b.transactionid,
abs( timepart(a.datetime) - timepart(b.datetime) ) as distance
from table a left join table b
on a.account=b.account
and datepart(a.datetime)=datepart(b.datetime)
group by a.account, a.date, a.datetime
having calculated distance) le &tolerance;
quit;
There is probably a better way that will increase performance and save time. Are you able to post some example data that resembles your actual data?
Below are a few sample records from each table. Transaction_Amts between the 2 datasets do not line up due to conversion rates and other factors in creation of the data in table A. Timestamps may be off
Table A Sample:
ACCOUNT_ID | TRANSACTION_DATE | DATE_TIME | TRANSACTION_AMT |
0000165033E8A3EFC15 | 06/12/2021 | 06Dec2021 22:57:42 | 1,599.00 |
00005D318BCCCE8AA93 | 23/12/2021 | 23Dec2021 18:54:00 | 1.00 |
0000653BAA33A87F242 | 30/10/2021 | 30Oct2021 14:27:00 | 14.90 |
00009F79D3AC781E7EE | 19/09/2021 | 19Sep2021 18:04:58 | 30.00 |
00009F79D3AC781E7EE | 19/09/2021 | 19Sep2021 18:05:58 | 30.00 |
Table B Sample
TRANSACTION_DTTM | TRANSACTION_ID | ACCOUNT_ID | TRANSACTION_AMT |
30Jul2021 21:00:16 | 6c6103ef9b044e1c | 465EE79A6412B62C825 | 0 |
30Jul2021 21:00:16 | 046103ef98044df9 | 77987B571FCF27B5633 | 0 |
30Jul2021 21:00:16 | 686103ee4404491c | A16EF924DB414AB3D3F | 17.9 |
30Jul2021 21:00:16 | 006103ef98044b13 | 1B4EE07C3778C8CB03A | 0.78 |
30Jul2021 21:00:16 | 64610435d3027617 | 22CFD191C879FE130EF | 0.38 |
30Jul2021 21:00:16 | 6c6103ef9b044e1d | 1747E9B180376F04F72 | 0 |
30Jul2021 21:00:10 | 386103eea4044cbe | 1F1EF0192FB7F1C8317 | 1.26 |
30Jul2021 21:00:10 | 6c6103ef9b044dfd | F494EED4074847E41A8 | 180.19 |
Can you put these in a datastep code?
So what does your desired result look like given this data?
I'd like a table of everything from Table A, plus the transaction ID from table B that matches on account ID and date exactly, with closest match on timestamp. Amounts are unreliable, so they're not very useful to me in joining.
See if you can use this as an example. Should scale well.
data one;
input id n1;
datalines;
1 3.4
1 7.9
1 6.1
2 4.7
2 9.2
3 6.3
3 1.1
3 8.3
;
proc plan;
factors id = 3 n2 = 5 / noprint;
output out=two;
run;quit;
data want(keep = id n2 closest);
if _N_ = 1 then do;
if 0 then set one;
dcl hash h(dataset : "one(rename = (id = _id)", ordered : "Y");
h.definekey("_id", "n1");
h.definedone();
dcl hiter i("h");
end;
set two;
_id = .;
if h.check(key : id, key : n2) = 0 then closest = n2;
else do;
h.add(key : id, key : n2, data : id, data : n2);
if i.setcur(key : id, key : n2) = 0 then
if i.prev() = 0 & id = _id then pn1 = n1;
if i.setcur(key : id, key : n2) = 0 then
if i.next() = 0 & id = _id then nn1 = n1;
if nn1 = . then idx = 1;
else if pn1 = . then idx = 2;
else idx = 1 + (n2 - pn1 > nn1 - n2);
closest = choosen(idx, pn1, nn1);
h.remove(key : id, key : n2);
end;
run;
I would suggest that you select a max value of time stamp difference and modify your query as follows.
%let tolerance = ;/* select the value you can tolerate*/
proc sql;
create table timematch as
select a*, b.transactionid,
abs( timepart(a.datetime) - timepart(b.datetime) ) as distance
from table a left join table b
on a.account=b.account
and datepart(a.datetime)=datepart(b.datetime)
group by a.account, a.date, a.datetime
having calculated distance) le &tolerance;
quit;
Hello @lujo1017
The account_id's in the two tables you have posted are different.
Therefore the joined data set will only have data from the left table.
Please put a representative data as a sas datastep and make sure the datasets have some rows which fulfill the join condition.
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.