Re: Joining Large Tables

DiegoDiaz · Posted 12-20-2018 08:02 AM

Hi All-

I have been trying to join two large tables by a simple join but it has not been possible since the tables are huge and the joining process is taking forever. I need a table with the number of transactions and the amount by account and month level. I have transaction data since 1998.

I tried to optimize a PROC SQL and also tried hash table but i am expert using hash table so not sure what is the best way to this.

Any help is much appreciated

Thanks in advance.

Table 1: Account level data ( It has around 7 Millions of records)

Account	Date_Open	Product_Type
123	1/1/2017	CC
234	2/8/2016	DC
345	3/9/2018	CC

Table 1:Transaction Level Data (It has millions of records. may have 10 times the account table) .

Account	Transaction_ID	Transaction_Amt	Posted_Date	Tran_Code
123	1	$ 30	3/4/2000	N
123	2	$ 35	5/3/2001	A
123	3	$ 1,244	4/7/2016	C
234	4	$ 344	9/1/2011	N
234	5	$ 678	4/6/2015	A
234	6	$ 856	4/7/2015	A
345	7	$ 332	9/30/2000	C
345	8	$ 123	8/4/2005	A
345	9	$ 765	4/9/2017	C
345	10	$ 324	9/4/2018	N
345	11	$ 2,344	12/12/2018	A

DATA test(drop=rc);
if 0 then set Transaction_Table(where=(PST50_TRAN_CD_EX IN(&TRA_CD.)));
declare Hash Trans (dataset:"Transaction_Table");
rc = Trans.DefineKey ('Account');
rc = Trans.DefineData ('Posted_Date','Transaction_Amt');
rc = Trans.DefineDone ();
do until (eof) ;
set acct end = eof;
call missing(Transaction_Table);
rc = Trans.find ();
if rc=0 then output;;
end;
stop;
run;

PROC SQL;
CREATE TABLE DD_TABLE_M AS SELECT
A.account,
N(B.Posted_Date) AS NUM_TX,
SUM(B.Transaction_Amt) AS AMT_TX,
D.YR_MTH_NBR
FROM acct A, Transaction_Table B, month D
WHERE A.account = B.account and D.TM_ID = B.TM_ID and
B.PST50_TRAN_CD_EX IN (SELECT DISTINCT Tran_Code FROM Tran_Code WHERE CUS IN('A','C'));
QUIT;

Kurt_Bremser · Posted 12-20-2018 08:28 AM

Is the account table unique with regard to Account? If yes, sort both tables and do a data step merge.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

DiegoDiaz · Posted 12-20-2018 08:33 AM

Yes, Account is unique id in both table,

Data step merge option was not optimal for this since i had to sort the tables before and i took even more time,

Thanks

Kurt_Bremser · Posted 12-20-2018 09:21 AM

There are at least three occurences of "123" in your transaction dataset, so it surely is not unique there.

By "unique" I mean that a proc sort nodupkey of the account table by account will not delete any observations.

The fastest method of joining that is not done with in-memory techniques is sort and data step merge, in my experience.

If you want to go down the hash path, I suggest to load the smaller table into hash:

data account;
input Account :$3. Date_Open :mmddyy10. Product_Type :$2.;
format Date_Open yymmddd10.;
datalines;
123 1/1/2017 CC
234 2/8/2016 DC
345 3/9/2018 CC
;
run;

data transaction;
input Account :$3. Transaction_ID Transaction_Amt :comma. Posted_Date :mmddyy10. Tran_Code :$1.;
format Posted_Date yymmddd10.;
datalines;
123 1 30 3/4/2000 N
123 2 35 5/3/2001 A
123 3 1,244 4/7/2016 C
234 4 344 9/1/2011 N
234 5 678 4/6/2015 A
234 6 856 4/7/2015 A
345 7 332 9/30/2000 C
345 8 123 8/4/2005 A
345 9 765 4/9/2017 C
345 10 324 9/4/2018 N
345 11 2,344 12/12/2018 A
;
run;

data want;
set transaction;
if _n_ = 0 then set account; /* initializes the structure */
if _n_ = 1
then do;
  declare hash acc (dataset:'account');
  acc.definekey("account");
  acc.definedata("date_open","product_type");
  acc.definedone();
end;
if not acc.find() then output;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Joining Large Tables