08-31-2017 03:49 PM - edited 08-31-2017 03:50 PM
I would be gratefull if anyone could tell me what is best and most efficient way for left join same size 20 datasets
with about 2 mln obs and nearly 50 obs in each.
I found in some articles that the best way for left join are Hash tables if the left datasets are smaller, but what if all of them are same size?
08-31-2017 04:07 PM - edited 08-31-2017 04:09 PM
If they're all based on the same key, you can use a data step merge with an in statement (sort the data first):
data merged; merge data1(in=in1) data2 data3 ... etc.; by key; if in1 then output; run;
If you have multiple keys that are different from one another, you can use proc sql with something like this:
proc sql: create table merged as select t1.*, t2.* ... etc. from data1 as t1 left join data2 as t2 on (key1=key1 & key2=key2); quit;
08-31-2017 04:24 PM
08-31-2017 04:32 PM - edited 08-31-2017 04:50 PM
I agree. I think sorting would be more efficient. I'm not an expert on the performance of the procedures, but my understanding is that data step merges rely more on I-O of your hard drive and SQL and Hashes are more dependent on memory, as they are done by pulling the data into RAM. One way to significantly reduce hard drive I-O is by using data views. You can create a data view for each of your datasets and then sort the view without outputting any data until the final merge step. Here's some code from this SAS paper:
data ALL / view=ALL; set A B C...; run; proc sort data=ALL; by <keys>; run;
08-31-2017 06:35 PM
One very important question...are both your data sources regular SAS datasets? If not, ignore everything you've read here, and we're into a whole different discission.
08-31-2017 06:51 PM
Okay, the best advice I can think of is this:
For all of your datasets that are in Oracle, use Oracle to join them using SQL, and let the DBMS worry about the most efficient way to do it. Oracle is highly tuned, and will almost certainly do a better job than anything we can suggest.
Once you have all of your Oracle tables joined into one table, copy that dataset into a SAS dataset using SQL with an ORDER BY clause. Again, you'll get the benefit of Oracle processing power, and your resulting SAS dataset will be in the correct sequence.
Then sort each of your other SAS datasets. In my opinion, indexes are most useful when they're being used to retrieve a very small proportion of records, like with a last name or a telephone number. Once you get up to 30% and above of the records in a query result, they tend to be less useful (opinion, I haven't had a chance to benchmark this.)
Once i) all of your datasets are in SAS, and ii) sorted in the correct sequence, be sure that you don't have any many-to-many key situations, and then join them using SQL. I think that the SQL syntax is clearer than the data set merge syntax, and you're less likely to have trouble.
I don't think the volumes you mention will be much of a problem. That volume is pretty much something a well equipped PC could handle.
08-31-2017 07:41 PM
Are you sure you need to left join them? If they're all about the same size, I suspect they have the same information, would an APPEND (stack them) instead be an alternative, followed by a TRANSPOSE for the variables you're interested in?
08-31-2017 10:47 PM
There are so many factors to consider.
Input : 20 tables, 2m observations each, 50 variables each.
So you want to end up with output: 1 table, 2m observations, 1000 variables ?
I suppose not .Do the variables overwite each other? or do you just keep a few from the "left" tables? How many?
Are all matches one to one (i.e. they use a unique key) ?
Are matches equi-joins (only the = sign is used for joining) ?
What's the expected match rate versus the "right" table? one percent? 10 percent? most obs?
I'd try SQL to start with -just because it's simpler to code- and see how that goes. Use the _method option to peek under the hood.
This runs in a couple of minutes:
data T1 (keep=V1: KEY) T2 (keep=V2: KEY) T3 (keep=V3: KEY) T4 (keep=V4: KEY) T5 (keep=V5: KEY) T6 (keep=V6: KEY) T7 (keep=V7: KEY) T8 (keep=V8: KEY) T9 (keep=V9: KEY); retain V1_1-V1_50 V2_1-V2_50 V3_1-V3_50 V4_1-V4_50 V5_1-V5_50 V6_1-V6_50 V7_1-V7_50 V8_1-V8_50 V9_1-V9_50 12345; do I=1 to 2e6; KEY=int(ranuni(0)*1e12); output; end; run; proc sql _method; create table T as select T1.* ,T2.V2_1 ,T3.V3_1 ,T4.V4_1 ,T5.V5_1 ,T6.V6_1 ,T7.V7_1 ,T8.V8_1 ,T9.V9_1 from T1 left join T2 on T1.KEY=T2.KEY left join T3 on T1.KEY=T3.KEY left join T4 on T1.KEY=T4.KEY left join T5 on T1.KEY=T5.KEY left join T6 on T1.KEY=T6.KEY left join T7 on T1.KEY=T7.KEY left join T8 on T1.KEY=T8.KEY left join T9 on T1.KEY=T9.KEY ; quit;
Also, if you start building intermediate tables manually, I strongly recommend that you create a SPDE library and that you do all the processing there.
libname SPEEDY spde "%sysfunc(pathname(WORK))" compress=binary partsize=10G;
08-31-2017 11:59 PM
Another critical factor is whether you need some of SQL's unique features (ability to fuzzy join, special operators like between or sounds like, ability to transform the keys on the fly, etc) or some of the data step's (known sorted order, arrays, predictable row number, lag function, multiple outputs. etc).