Hi All-
I have been trying to join two large tables by a simple join but it has not been possible since the tables are huge and the joining process is taking forever. I need a table with the number of transactions and the amount by account and month level. I have transaction data since 1998.
I tried to optimize a PROC SQL and also tried hash table but i am expert using hash table so not sure what is the best way to this.
Any help is much appreciated
Thanks in advance.
Table 1: Account level data ( It has around 7 Millions of records)
Account | Date_Open | Product_Type |
123 | 1/1/2017 | CC |
234 | 2/8/2016 | DC |
345 | 3/9/2018 | CC |
Table 1:Transaction Level Data (It has millions of records. may have 10 times the account table) .
Account | Transaction_ID | Transaction_Amt | Posted_Date | Tran_Code |
123 | 1 | $ 30 | 3/4/2000 | N |
123 | 2 | $ 35 | 5/3/2001 | A |
123 | 3 | $ 1,244 | 4/7/2016 | C |
234 | 4 | $ 344 | 9/1/2011 | N |
234 | 5 | $ 678 | 4/6/2015 | A |
234 | 6 | $ 856 | 4/7/2015 | A |
345 | 7 | $ 332 | 9/30/2000 | C |
345 | 8 | $ 123 | 8/4/2005 | A |
345 | 9 | $ 765 | 4/9/2017 | C |
345 | 10 | $ 324 | 9/4/2018 | N |
345 | 11 | $ 2,344 | 12/12/2018 | A |
DATA test(drop=rc);
if 0 then set Transaction_Table(where=(PST50_TRAN_CD_EX IN(&TRA_CD.)));
declare Hash Trans (dataset:"Transaction_Table");
rc = Trans.DefineKey ('Account');
rc = Trans.DefineData ('Posted_Date','Transaction_Amt');
rc = Trans.DefineDone ();
do until (eof) ;
set acct end = eof;
call missing(Transaction_Table);
rc = Trans.find ();
if rc=0 then output;;
end;
stop;
run;
PROC SQL;
CREATE TABLE DD_TABLE_M AS SELECT
A.account,
N(B.Posted_Date) AS NUM_TX,
SUM(B.Transaction_Amt) AS AMT_TX,
D.YR_MTH_NBR
FROM acct A, Transaction_Table B, month D
WHERE A.account = B.account and D.TM_ID = B.TM_ID and
B.PST50_TRAN_CD_EX IN (SELECT DISTINCT Tran_Code FROM Tran_Code WHERE CUS IN('A','C'));
QUIT;
Is the account table unique with regard to Account? If yes, sort both tables and do a data step merge.
Yes, Account is unique id in both table,
Data step merge option was not optimal for this since i had to sort the tables before and i took even more time,
Thanks
There are at least three occurences of "123" in your transaction dataset, so it surely is not unique there.
By "unique" I mean that a proc sort nodupkey of the account table by account will not delete any observations.
The fastest method of joining that is not done with in-memory techniques is sort and data step merge, in my experience.
If you want to go down the hash path, I suggest to load the smaller table into hash:
data account;
input Account :$3. Date_Open :mmddyy10. Product_Type :$2.;
format Date_Open yymmddd10.;
datalines;
123 1/1/2017 CC
234 2/8/2016 DC
345 3/9/2018 CC
;
run;
data transaction;
input Account :$3. Transaction_ID Transaction_Amt :comma. Posted_Date :mmddyy10. Tran_Code :$1.;
format Posted_Date yymmddd10.;
datalines;
123 1 30 3/4/2000 N
123 2 35 5/3/2001 A
123 3 1,244 4/7/2016 C
234 4 344 9/1/2011 N
234 5 678 4/6/2015 A
234 6 856 4/7/2015 A
345 7 332 9/30/2000 C
345 8 123 8/4/2005 A
345 9 765 4/9/2017 C
345 10 324 9/4/2018 N
345 11 2,344 12/12/2018 A
;
run;
data want;
set transaction;
if _n_ = 0 then set account; /* initializes the structure */
if _n_ = 1
then do;
declare hash acc (dataset:'account');
acc.definekey("account");
acc.definedata("date_open","product_type");
acc.definedone();
end;
if not acc.find() then output;
run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.