Hello,
I would be gratefull if anyone could tell me what is best and most efficient way for left join same size 20 datasets
with about 2 mln obs and nearly 50 obs in each.
I found in some articles that the best way for left join are Hash tables if the left datasets are smaller, but what if all of them are same size?
Thank you.
If they're all based on the same key, you can use a data step merge with an in statement (sort the data first):
data merged; merge data1(in=in1) data2 data3 ... etc.; by key; if in1 then output; run;
If you have multiple keys that are different from one another, you can use proc sql with something like this:
proc sql: create table merged as select t1.*, t2.* ... etc. from data1 as t1 left join data2 as t2 on (key1=key1 & key2=key2); quit;
Should I create indexes?
I agree. I think sorting would be more efficient. I'm not an expert on the performance of the procedures, but my understanding is that data step merges rely more on I-O of your hard drive and SQL and Hashes are more dependent on memory, as they are done by pulling the data into RAM. One way to significantly reduce hard drive I-O is by using data views. You can create a data view for each of your datasets and then sort the view without outputting any data until the final merge step. Here's some code from this SAS paper:
data ALL / view=ALL; set A B C...; run; proc sort data=ALL; by <keys>; run;
Another advantage of sorts over indexes is that a sort can multithread whereas I don't believe index creation will.
One very important question...are both your data sources regular SAS datasets? If not, ignore everything you've read here, and we're into a whole different discission.
Tom
Okay, the best advice I can think of is this:
For all of your datasets that are in Oracle, use Oracle to join them using SQL, and let the DBMS worry about the most efficient way to do it. Oracle is highly tuned, and will almost certainly do a better job than anything we can suggest.
Once you have all of your Oracle tables joined into one table, copy that dataset into a SAS dataset using SQL with an ORDER BY clause. Again, you'll get the benefit of Oracle processing power, and your resulting SAS dataset will be in the correct sequence.
Then sort each of your other SAS datasets. In my opinion, indexes are most useful when they're being used to retrieve a very small proportion of records, like with a last name or a telephone number. Once you get up to 30% and above of the records in a query result, they tend to be less useful (opinion, I haven't had a chance to benchmark this.)
Once i) all of your datasets are in SAS, and ii) sorted in the correct sequence, be sure that you don't have any many-to-many key situations, and then join them using SQL. I think that the SQL syntax is clearer than the data set merge syntax, and you're less likely to have trouble.
I don't think the volumes you mention will be much of a problem. That volume is pretty much something a well equipped PC could handle.
Tom
Are you sure you need to left join them? If they're all about the same size, I suspect they have the same information, would an APPEND (stack them) instead be an alternative, followed by a TRANSPOSE for the variables you're interested in?
There are so many factors to consider.
Input : 20 tables, 2m observations each, 50 variables each.
So you want to end up with output: 1 table, 2m observations, 1000 variables ?
I suppose not .Do the variables overwite each other? or do you just keep a few from the "left" tables? How many?
Are all matches one to one (i.e. they use a unique key) ?
Are matches equi-joins (only the = sign is used for joining) ?
What's the expected match rate versus the "right" table? one percent? 10 percent? most obs?
I'd try SQL to start with -just because it's simpler to code- and see how that goes. Use the _method option to peek under the hood.
This runs in a couple of minutes:
data T1 (keep=V1: KEY)
T2 (keep=V2: KEY)
T3 (keep=V3: KEY)
T4 (keep=V4: KEY)
T5 (keep=V5: KEY)
T6 (keep=V6: KEY)
T7 (keep=V7: KEY)
T8 (keep=V8: KEY)
T9 (keep=V9: KEY);
retain V1_1-V1_50
V2_1-V2_50
V3_1-V3_50
V4_1-V4_50
V5_1-V5_50
V6_1-V6_50
V7_1-V7_50
V8_1-V8_50
V9_1-V9_50 12345;
do I=1 to 2e6;
KEY=int(ranuni(0)*1e12);
output;
end;
run;
proc sql _method;
create table T as
select T1.*
,T2.V2_1
,T3.V3_1
,T4.V4_1
,T5.V5_1
,T6.V6_1
,T7.V7_1
,T8.V8_1
,T9.V9_1
from T1
left join
T2
on T1.KEY=T2.KEY
left join
T3
on T1.KEY=T3.KEY
left join
T4
on T1.KEY=T4.KEY
left join
T5
on T1.KEY=T5.KEY
left join
T6
on T1.KEY=T6.KEY
left join
T7
on T1.KEY=T7.KEY
left join
T8
on T1.KEY=T8.KEY
left join
T9
on T1.KEY=T9.KEY ;
quit;
Also, if you start building intermediate tables manually, I strongly recommend that you create a SPDE library and that you do all the processing there.
libname SPEEDY spde "%sysfunc(pathname(WORK))" compress=binary partsize=10G;
Another critical factor is whether you need some of SQL's unique features (ability to fuzzy join, special operators like between or sounds like, ability to transform the keys on the fly, etc) or some of the data step's (known sorted order, arrays, predictable row number, lag function, multiple outputs. etc).
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.