BookmarkSubscribeRSS Feed
DiegoDiaz
Calcite | Level 5

Hi All-

 

I have been trying to join two large tables by a simple join but it has not been possible since the tables are huge and the joining process is taking forever. I need a table with the number of transactions and the amount by account and month level. I have transaction data since 1998.

 

I tried to optimize a PROC SQL and also tried hash table but i am expert using hash table so not sure what is the best way to this. 

 

Any help is much appreciated

 

Thanks in advance. 

 

Table 1: Account level data ( It has around 7 Millions of records)

AccountDate_OpenProduct_Type
1231/1/2017CC
2342/8/2016DC
3453/9/2018CC

 

Table 1:Transaction Level Data (It has millions of records. may have 10 times the account table) .

AccountTransaction_IDTransaction_AmtPosted_DateTran_Code
1231 $                            303/4/2000N
1232 $                            355/3/2001A
1233 $                       1,2444/7/2016C
2344 $                          3449/1/2011N
2345 $                          6784/6/2015A
2346 $                          8564/7/2015A
3457 $                          3329/30/2000C
3458 $                          1238/4/2005A
3459 $                          7654/9/2017C
34510 $                          3249/4/2018N
34511 $                       2,34412/12/2018A

 

DATA test(drop=rc);
if 0 then set Transaction_Table(where=(PST50_TRAN_CD_EX IN(&TRA_CD.)));
declare Hash Trans (dataset:"Transaction_Table");
rc = Trans.DefineKey ('Account');
rc = Trans.DefineData ('Posted_Date','Transaction_Amt');
rc = Trans.DefineDone ();
do until (eof) ;
set acct end = eof;
call missing(Transaction_Table);
rc = Trans.find ();
if rc=0 then output;;
end;
stop;
run;


PROC SQL;
CREATE TABLE DD_TABLE_M AS SELECT
A.account,
N(B.Posted_Date) AS NUM_TX,
SUM(B.Transaction_Amt) AS AMT_TX,
D.YR_MTH_NBR
FROM acct A, Transaction_Table B, month D
WHERE A.account = B.account and D.TM_ID = B.TM_ID and
B.PST50_TRAN_CD_EX IN (SELECT DISTINCT Tran_Code FROM Tran_Code WHERE CUS IN('A','C'));
QUIT;

3 REPLIES 3
DiegoDiaz
Calcite | Level 5

Yes, Account is unique id in both table,

 

Data step merge option was not optimal for this since i had to sort the tables before and i took even more time,

Thanks

Kurt_Bremser
Super User

There are at least three occurences of "123" in your transaction dataset, so it surely is not unique there.

By "unique" I mean that a proc sort nodupkey of the account table by account will not delete any observations.

 

The fastest method of joining that is not done with in-memory techniques is sort and data step merge, in my experience.

If you want to go down the hash path, I suggest to load the smaller table into hash:

data account;
input Account :$3. Date_Open :mmddyy10. Product_Type :$2.;
format Date_Open yymmddd10.;
datalines;
123 1/1/2017 CC
234 2/8/2016 DC
345 3/9/2018 CC
;
run;

data transaction;
input Account :$3. Transaction_ID Transaction_Amt :comma. Posted_Date :mmddyy10. Tran_Code :$1.;
format Posted_Date yymmddd10.;
datalines;
123 1 30 3/4/2000 N
123 2 35 5/3/2001 A
123 3 1,244 4/7/2016 C
234 4 344 9/1/2011 N
234 5 678 4/6/2015 A
234 6 856 4/7/2015 A
345 7 332 9/30/2000 C
345 8 123 8/4/2005 A
345 9 765 4/9/2017 C
345 10 324 9/4/2018 N
345 11 2,344 12/12/2018 A
;
run;

data want;
set transaction;
if _n_ = 0 then set account; /* initializes the structure */
if _n_ = 1
then do;
  declare hash acc (dataset:'account');
  acc.definekey("account");
  acc.definedata("date_open","product_type");
  acc.definedone();
end;
if not acc.find() then output;
run;