Re: Merge in more efficient way

Ronein · Posted 10-08-2021 04:59 AM

Hello,

In real life I am trying to join the following data sets:

data set "Have" has 2 million rows

data set "Extrainfo" has 50 million rows

This query of merging taking long long time and I want to ask for offers how to make it more efficient with lower time.

May you please also show the recommended code , thanks

Thanks

Data Have;
format date date9.;
Input CustID month date :date9. W1 W2 W3 month_H;
cards;
1 2101 15JAN2021 10 20 30 2011
1 2101 16JAN2021 11 12 13 2011
1 2102 13FEB2021 15 17 19 2012
2 2102 15FEB2021 18 16 14 2011
2 2103 03MAR2021 13 15 12 2101
3 2101 19JAN2021 11 18 15 2102
4 2103 17MAR2021 13 12 11 2101
4 2104 18APR2021 19 18 13 2102
Run;

Data ExtraInfo;
Input CustID month X1 X2 X3;
cards;
1 2011 30 40 50
1 2012 35 45 55
1 2101 25 20 30
1 2102 15 20 40
1 2103 30 23 43
1 2104 19 21 82
2 2011 12 15 17 
2 2012 32 54 28
2 2101 23 21 98
2 2102 43 54 12
2 2103 12 43 12 
2 2104 43 23 25
3 2011 23 42 98
3 2012 87 67 34
3 2101 34 42 21
3 2102 98 23 65
3 2103 23 24 26
3 2104 43 64 48
4 2011 43 54 38
4 2012 32 54 27
4 2101 23 65 28
4 2102 43 26 87
4 2103 54 87 29
4 2104 67 45 53
;
Run;


PROC SQL;
	create table  wanted   as
	select a.*,
     b.X1 as  X1_H ,
	 c.X2  as X2_C,
	 d.X3 as X3_2012 
	from  Have  as a
	left join 	ExtraInfo as b
	on a.CustID=b.CustID  and a.month_H=b.month

	left join 	ExtraInfo as c
	on a.CustID=c.CustID  and a.month=c.month

	left join 	(select * FROM ExtraInfo where month=2012) as d
	on a.CustID=d.CustID  
;
QUIT;

ChrisNZ · Posted 10-08-2021 05:49 AM

1. Are the tables sorted by CUSTID and MONTH ?

2. Have you tried hash tables?

3. The last join is more likely to use sort order written as:

left join ExtraInfo
	on a.CustID=d.CustID  and d.month=2012

High-Performance SAS Coding - Third Edition

Ronein · Posted 10-08-2021 06:10 AM

Thanks for your reply,
1. Are the tables sorted by CUSTID and MONTH ?No(tables are coming to me not sorted)
2. Have you tried hash tables? (No, may you show a code and I will check the efficient)
3-left join ExtraInfo
on a.CustID=d.CustID and d.month=2012
will it be better than the code I provided?

ChrisNZ · Posted 10-08-2021 06:20 AM

Sort the tables before merging and use the join I provided for 2012.

High-Performance SAS Coding - Third Edition

Ronein · Posted 10-09-2021 02:32 AM

As I understand from you the folowing changes will help:

1-create an index on CUSID MONTH_H

Should I create Index in both tables?

2-Sort the tables before merging.

May I ask why?

Using proc sql join no need to sort before

3-use the different code related to 2012

May you please tell If I understand correctly and why should sort before using proc sql join

ChrisNZ · Posted 10-09-2021 06:17 AM

1. MONTH_H is only in one table isn't it? How could you create the index on both?

2. Proc SQL does sort for you, but EXTRAINFO is used 3 times and proc SQL might be tempted to sort 3 times.
Also HAVE needs that sort order for the 3 joins, so you might as well do it once for all.
Same thing for the index: Since you need that sort order for the join, and since the current performance is not satisfying, add the index and be done with it.

3. I don't understand what the issue is. Just do it.
Another syntax is better (as in faster) than yours is

left join ExtraInfo(where=(month=2012))
	on a.CustID=d.CustID

It's better because you point to the EXTRAINFO table for the join (and not to a select clause) so can use index or sort order.

Last but not least: the best way to learn is to do.
Get guidance by all means, but try and try again. You have 9 replies; what have what you tried?

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 10-08-2021 06:42 AM

And create an index on CUSID MONTH_H

High-Performance SAS Coding - Third Edition

acordes · Posted 10-08-2021 05:52 AM

If you're on sas viya then proc fedsql is really fast.

here comes an example code:

proc fedsql sessref=mysession  _method exec;
create table "mkt"."testa" {options  replication=0 REPLACE=true } as select 
"T1".*, case when ("T2"."numbasti")='' then 0 else 1 end as renewed
from "dna"."from_ended" {options tableID=1} T1 left join "dna"."from_renewed" t2
on t1."numbasti"=t2."numbasti";
quit ;

Ronein · Posted 10-08-2021 06:11 AM

May you please show it (proc fedsql ) on my tables?

mkeintz · Posted 10-09-2021 12:20 AM

I suspect you can speed things up reading the small dataset (HAVE) twice, and the large dataset (EXTRAINFO) once. The strategy is to store only the needed subset of extrainfo data (x1,x2, and x3 just for the HAVE-matched) in memory-resident hash objects. Then the second reading of have retrieves data from these hash objects.

I.e.:

read have and create a hash object of its CUSTID values. Call it HAVE_CUSTID
read EXTRAINFO, and if the custid is found in HAVE_CUSTID, then
1. Add its X1 value to hash object HX!, keyed on custid and month_h (using MONTH from EXTRAINFO as MONTH_H in the object).
2. Add its X2 value to hash object HX2, keyd on custid and month.
3. If month=2012 then add X3 to hash object HX3, keyed on custid.
reread HAVE and retrieve X1 X2 and X3 from the relevant hash objects.

I assume that the CUSTIDs in HAVE are a relatively small subset of those in EXTRAINFO. If not, there may be performance benefits in using the HASHEXP option in the hash declaration statements. You could experiment with hashexp values greater than the default 8. See DECLARE Statement: Hash and Hash Iterator Objects and Run Time Effect Of Hash Object HASHEXP Argument Size

More importantly, this also assumes that the observations in EXTRAINFO are unique for each CUSTID/MONTH value pair.

Data Have;
format date date9.;
Input CustID month date :date9. W1 W2 W3 month_H;
cards;
1 2101 15JAN2021 10 20 30 2011
1 2101 16JAN2021 11 12 13 2011
1 2102 13FEB2021 15 17 19 2012
2 2102 15FEB2021 18 16 14 2011
2 2103 03MAR2021 13 15 12 2101
3 2101 19JAN2021 11 18 15 2102
4 2103 17MAR2021 13 12 11 2101
4 2104 18APR2021 19 18 13 2102
Run;



Data ExtraInfo;
Input CustID month X1 X2 X3;
cards;
1 2011 30 40 50
1 2012 35 45 55
1 2101 25 20 30
1 2102 15 20 40
1 2103 30 23 43
1 2104 19 21 82
2 2011 12 15 17 
2 2012 32 54 28
2 2101 23 21 98
2 2102 43 54 12
2 2103 12 43 12 
2 2104 43 23 25
3 2011 23 42 98
3 2012 87 67 34
3 2101 34 42 21
3 2102 98 23 65
3 2103 23 24 26
3 2104 43 64 48
4 2011 43 54 38
4 2012 32 54 27
4 2101 23 65 28
4 2102 43 26 87
4 2103 54 87 29
4 2104 67 45 53
;
Run;


data want (rename=(x1=x1_h x2=x2_c x3=x3_2012));
  if _n_=1 then do;

    /* The DATASET option makes this the first reading of HAVE*/
    declare hash have_custid (dataset:'have (keep=custid)');
      have_custid.definekey('custid');
      have_custid.definedone();

    declare hash hx1 ();
      hx1.definekey('custid','month_h');
      hx1.definedata('x1');
      hx1.definedone();
    declare hash hx2 ();
      hx2.definekey('custid','month');
      hx2.definedata('x2');
      hx2.definedone();
    declare hash hx3 ();
      hx3.definekey('custid');
      hx3.definedata('x3');
      hx3.definedone();

    /*Now read EXTRAINFO and populate HX1, HX2, and HX3 */
    do until (end_of_extra);
      set extrainfo end=end_of_extra;
      if have_custid.check()^=0 then continue; /*Skip this iteration of the loop*/
      hx1.add(key:custid,key:month,data:x1);
      hx2.add();
      if month=2012 then hx3.add();
    end;
  end;

  set have;  /*This is the second reading of HAVE*/
  call missing(x1,x2,x3);
  hx1.find();
  hx2.find();
  hx3.find();
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

acordes · Posted 10-09-2021 01:59 AM

Thanks to you @mkeintz I've finally understood the hash technique!
I don't think that I'll actually need it because cas actions cas data steps are really fast, but it's nice to have it in the toolset.

Ronein · Posted 10-09-2021 06:33 AM

What is better ?
left join ExtraInfo(where=(month=2012)) on a.CustID=d.CustID

Or
left join ExtraInfo
on a.CustID=d.CustID
And a.month=2012

ChrisNZ · Posted 10-09-2021 06:21 PM

The performance should be the same. The second syntax is fully SQL compliant.

High-Performance SAS Coding - Third Edition

Kurt_Bremser · Posted 10-11-2021 05:01 AM

@mkeintz I would use only one hash; keeping only one search tree should reduce the amount of memory needed.

data want;
if _n_ = 1 then do;
  declare hash have_custid (dataset:'have (keep=custid)');
  have_custid.definekey('custid');
  have_custid.definedone();
  declare hash hx ();
  hx.definekey('custid','month');
  hx.definedata('x1','x2','x3');
  hx.definedone();
  do until (end_of_extra);
    set extrainfo end=end_of_extra;
    if have_custid.check() ^= 0 then continue;
    hx.add();
  end;
end;
set have;
if hx.find(key:custid,key:month_h) = 0 then x1_h = x1;
if hx.find(key:custid,key:month) = 0 then x2_c = x2;
if hx.find(key:custid,key:2012) = 0 then x3_2012 = x3;
drop x1 x2 x3;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

mkeintz · Posted 10-11-2021 02:33 PM

@Kurt_Bremser

I think you are right.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away