Hello,
In real life I am trying to join the following data sets:
data set "Have" has 2 million rows
data set "Extrainfo" has 50 million rows
This query of merging taking long long time and I want to ask for offers how to make it more efficient with lower time.
May you please also show the recommended code , thanks
Thanks
Data Have;
format date date9.;
Input CustID month date :date9. W1 W2 W3 month_H;
cards;
1 2101 15JAN2021 10 20 30 2011
1 2101 16JAN2021 11 12 13 2011
1 2102 13FEB2021 15 17 19 2012
2 2102 15FEB2021 18 16 14 2011
2 2103 03MAR2021 13 15 12 2101
3 2101 19JAN2021 11 18 15 2102
4 2103 17MAR2021 13 12 11 2101
4 2104 18APR2021 19 18 13 2102
Run;
Data ExtraInfo;
Input CustID month X1 X2 X3;
cards;
1 2011 30 40 50
1 2012 35 45 55
1 2101 25 20 30
1 2102 15 20 40
1 2103 30 23 43
1 2104 19 21 82
2 2011 12 15 17
2 2012 32 54 28
2 2101 23 21 98
2 2102 43 54 12
2 2103 12 43 12
2 2104 43 23 25
3 2011 23 42 98
3 2012 87 67 34
3 2101 34 42 21
3 2102 98 23 65
3 2103 23 24 26
3 2104 43 64 48
4 2011 43 54 38
4 2012 32 54 27
4 2101 23 65 28
4 2102 43 26 87
4 2103 54 87 29
4 2104 67 45 53
;
Run;
PROC SQL;
create table wanted as
select a.*,
b.X1 as X1_H ,
c.X2 as X2_C,
d.X3 as X3_2012
from Have as a
left join ExtraInfo as b
on a.CustID=b.CustID and a.month_H=b.month
left join ExtraInfo as c
on a.CustID=c.CustID and a.month=c.month
left join (select * FROM ExtraInfo where month=2012) as d
on a.CustID=d.CustID
;
QUIT;
1. Are the tables sorted by CUSTID and MONTH ?
2. Have you tried hash tables?
3. The last join is more likely to use sort order written as:
left join ExtraInfo
on a.CustID=d.CustID and d.month=2012
As I understand from you the folowing changes will help:
1-create an index on CUSID MONTH_H
Should I create Index in both tables?
2-Sort the tables before merging.
May I ask why?
Using proc sql join no need to sort before
3-use the different code related to 2012
May you please tell If I understand correctly and why should sort before using proc sql join
1. MONTH_H is only in one table isn't it? How could you create the index on both?
2. Proc SQL does sort for you, but EXTRAINFO is used 3 times and proc SQL might be tempted to sort 3 times.
Also HAVE needs that sort order for the 3 joins, so you might as well do it once for all.
Same thing for the index: Since you need that sort order for the join, and since the current performance is not satisfying, add the index and be done with it.
3. I don't understand what the issue is. Just do it.
Another syntax is better (as in faster) than yours is
left join ExtraInfo(where=(month=2012))
on a.CustID=d.CustID
It's better because you point to the EXTRAINFO table for the join (and not to a select clause) so can use index or sort order.
Last but not least: the best way to learn is to do.
Get guidance by all means, but try and try again. You have 9 replies; what have what you tried?
If you're on sas viya then proc fedsql is really fast.
here comes an example code:
proc fedsql sessref=mysession _method exec;
create table "mkt"."testa" {options replication=0 REPLACE=true } as select
"T1".*, case when ("T2"."numbasti")='' then 0 else 1 end as renewed
from "dna"."from_ended" {options tableID=1} T1 left join "dna"."from_renewed" t2
on t1."numbasti"=t2."numbasti";
quit ;
May you please show it (proc fedsql ) on my tables?
I suspect you can speed things up reading the small dataset (HAVE) twice, and the large dataset (EXTRAINFO) once. The strategy is to store only the needed subset of extrainfo data (x1,x2, and x3 just for the HAVE-matched) in memory-resident hash objects. Then the second reading of have retrieves data from these hash objects.
I.e.:
I assume that the CUSTIDs in HAVE are a relatively small subset of those in EXTRAINFO. If not, there may be performance benefits in using the HASHEXP option in the hash declaration statements. You could experiment with hashexp values greater than the default 8. See DECLARE Statement: Hash and Hash Iterator Objects and Run Time Effect Of Hash Object HASHEXP Argument Size
More importantly, this also assumes that the observations in EXTRAINFO are unique for each CUSTID/MONTH value pair.
Data Have;
format date date9.;
Input CustID month date :date9. W1 W2 W3 month_H;
cards;
1 2101 15JAN2021 10 20 30 2011
1 2101 16JAN2021 11 12 13 2011
1 2102 13FEB2021 15 17 19 2012
2 2102 15FEB2021 18 16 14 2011
2 2103 03MAR2021 13 15 12 2101
3 2101 19JAN2021 11 18 15 2102
4 2103 17MAR2021 13 12 11 2101
4 2104 18APR2021 19 18 13 2102
Run;
Data ExtraInfo;
Input CustID month X1 X2 X3;
cards;
1 2011 30 40 50
1 2012 35 45 55
1 2101 25 20 30
1 2102 15 20 40
1 2103 30 23 43
1 2104 19 21 82
2 2011 12 15 17
2 2012 32 54 28
2 2101 23 21 98
2 2102 43 54 12
2 2103 12 43 12
2 2104 43 23 25
3 2011 23 42 98
3 2012 87 67 34
3 2101 34 42 21
3 2102 98 23 65
3 2103 23 24 26
3 2104 43 64 48
4 2011 43 54 38
4 2012 32 54 27
4 2101 23 65 28
4 2102 43 26 87
4 2103 54 87 29
4 2104 67 45 53
;
Run;
data want (rename=(x1=x1_h x2=x2_c x3=x3_2012));
if _n_=1 then do;
/* The DATASET option makes this the first reading of HAVE*/
declare hash have_custid (dataset:'have (keep=custid)');
have_custid.definekey('custid');
have_custid.definedone();
declare hash hx1 ();
hx1.definekey('custid','month_h');
hx1.definedata('x1');
hx1.definedone();
declare hash hx2 ();
hx2.definekey('custid','month');
hx2.definedata('x2');
hx2.definedone();
declare hash hx3 ();
hx3.definekey('custid');
hx3.definedata('x3');
hx3.definedone();
/*Now read EXTRAINFO and populate HX1, HX2, and HX3 */
do until (end_of_extra);
set extrainfo end=end_of_extra;
if have_custid.check()^=0 then continue; /*Skip this iteration of the loop*/
hx1.add(key:custid,key:month,data:x1);
hx2.add();
if month=2012 then hx3.add();
end;
end;
set have; /*This is the second reading of HAVE*/
call missing(x1,x2,x3);
hx1.find();
hx2.find();
hx3.find();
run;
The performance should be the same. The second syntax is fully SQL compliant.
@mkeintz I would use only one hash; keeping only one search tree should reduce the amount of memory needed.
data want;
if _n_ = 1 then do;
declare hash have_custid (dataset:'have (keep=custid)');
have_custid.definekey('custid');
have_custid.definedone();
declare hash hx ();
hx.definekey('custid','month');
hx.definedata('x1','x2','x3');
hx.definedone();
do until (end_of_extra);
set extrainfo end=end_of_extra;
if have_custid.check() ^= 0 then continue;
hx.add();
end;
end;
set have;
if hx.find(key:custid,key:month_h) = 0 then x1_h = x1;
if hx.find(key:custid,key:month) = 0 then x2_c = x2;
if hx.find(key:custid,key:2012) = 0 then x3_2012 = x3;
drop x1 x2 x3;
run;
I think you are right.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.