Re: Efficient way for same size large datasets join.

Matt3 · Posted 08-31-2017 03:49 PM

Hello,

I would be gratefull if anyone could tell me what is best and most efficient way for left join same size 20 datasets

with about 2 mln obs and nearly 50 obs in each.

I found in some articles that the best way for left join are Hash tables if the left datasets are smaller, but what if all of them are same size?

Thank you.

ebowen · Posted 08-31-2017 04:07 PM

If they're all based on the same key, you can use a data step merge with an in statement (sort the data first):

data merged;
merge data1(in=in1) data2 data3 ... etc.;
by key;
if in1 then output;
run;

If you have multiple keys that are different from one another, you can use proc sql with something like this:

proc sql:
create table merged as
select t1.*, t2.* ... etc.
from data1 as t1
left join data2 as t2 on (key1=key1 & key2=key2);
quit;

Matt3 · Posted 08-31-2017 04:16 PM

Should I create indexes?

ChrisBrooks · Posted 08-31-2017 04:24 PM

If you're using a data step merge the data sets either have to be pre-sorted by the by group variables or indexed by them. Most people (myself included) would sort rather than create indexes.

ebowen · Posted 08-31-2017 04:32 PM

I agree. I think sorting would be more efficient. I'm not an expert on the performance of the procedures, but my understanding is that data step merges rely more on I-O of your hard drive and SQL and Hashes are more dependent on memory, as they are done by pulling the data into RAM. One way to significantly reduce hard drive I-O is by using data views. You can create a data view for each of your datasets and then sort the view without outputting any data until the final merge step. Here's some code from this SAS paper:

data ALL / view=ALL;
set A B C...;
run;
proc sort data=ALL; by <keys>; run;

ChrisBrooks · Posted 08-31-2017 04:58 PM

Another advantage of sorts over indexes is that a sort can multithread whereas I don't believe index creation will.

TomKari · Posted 08-31-2017 06:35 PM

One very important question...are both your data sources regular SAS datasets? If not, ignore everything you've read here, and we're into a whole different discission.

Tom

Matt3 · Posted 08-31-2017 06:41 PM

Datasets are created in Sas and stored both in sas server library and Oracle DB.

TomKari · Posted 08-31-2017 06:51 PM

Okay, the best advice I can think of is this:

For all of your datasets that are in Oracle, use Oracle to join them using SQL, and let the DBMS worry about the most efficient way to do it. Oracle is highly tuned, and will almost certainly do a better job than anything we can suggest.

Once you have all of your Oracle tables joined into one table, copy that dataset into a SAS dataset using SQL with an ORDER BY clause. Again, you'll get the benefit of Oracle processing power, and your resulting SAS dataset will be in the correct sequence.

Then sort each of your other SAS datasets. In my opinion, indexes are most useful when they're being used to retrieve a very small proportion of records, like with a last name or a telephone number. Once you get up to 30% and above of the records in a query result, they tend to be less useful (opinion, I haven't had a chance to benchmark this.)

Once i) all of your datasets are in SAS, and ii) sorted in the correct sequence, be sure that you don't have any many-to-many key situations, and then join them using SQL. I think that the SQL syntax is clearer than the data set merge syntax, and you're less likely to have trouble.

I don't think the volumes you mention will be much of a problem. That volume is pretty much something a well equipped PC could handle.

Tom

Reeza · Posted 08-31-2017 07:41 PM

Are you sure you need to left join them? If they're all about the same size, I suspect they have the same information, would an APPEND (stack them) instead be an alternative, followed by a TRANSPOSE for the variables you're interested in?

ChrisNZ · Posted 08-31-2017 10:47 PM

There are so many factors to consider.

Input : 20 tables, 2m observations each, 50 variables each.

So you want to end up with output: 1 table, 2m observations, 1000 variables ?

I suppose not .Do the variables overwite each other? or do you just keep a few from the "left" tables? How many?

Are all matches one to one (i.e. they use a unique key) ?

Are matches equi-joins (only the = sign is used for joining) ?

What's the expected match rate versus the "right" table? one percent? 10 percent? most obs?

I'd try SQL to start with -just because it's simpler to code- and see how that goes. Use the _method option to peek under the hood.

This runs in a couple of minutes:

data  T1 (keep=V1: KEY)
      T2 (keep=V2: KEY)
      T3 (keep=V3: KEY)
      T4 (keep=V4: KEY)
      T5 (keep=V5: KEY)
      T6 (keep=V6: KEY)
      T7 (keep=V7: KEY)
      T8 (keep=V8: KEY)
      T9 (keep=V9: KEY);
  retain V1_1-V1_50 
         V2_1-V2_50 
         V3_1-V3_50 
         V4_1-V4_50 
         V5_1-V5_50 
         V6_1-V6_50 
         V7_1-V7_50 
         V8_1-V8_50 
         V9_1-V9_50  12345;
  do I=1 to 2e6;
    KEY=int(ranuni(0)*1e12);
    output;
  end; 
run;               
          
proc sql _method;
  create table T as 
  select T1.* 
        ,T2.V2_1
        ,T3.V3_1
        ,T4.V4_1
        ,T5.V5_1
        ,T6.V6_1
        ,T7.V7_1
        ,T8.V8_1
        ,T9.V9_1
  from T1
         left join 
       T2 
         on T1.KEY=T2.KEY
         left join 
       T3 
         on T1.KEY=T3.KEY
         left join 
       T4 
         on T1.KEY=T4.KEY
         left join 
       T5 
         on T1.KEY=T5.KEY 
         left join 
       T6 
         on T1.KEY=T6.KEY 
         left join 
       T7 
         on T1.KEY=T7.KEY 
         left join 
       T8 
         on T1.KEY=T8.KEY 
         left join 
       T9 
         on T1.KEY=T9.KEY ;
quit;

Also, if you start building intermediate tables manually, I strongly recommend that you create a SPDE library and that you do all the processing there.

 libname SPEEDY spde "%sysfunc(pathname(WORK))" compress=binary partsize=10G;

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 08-31-2017 11:59 PM

Another critical factor is whether you need some of SQL's unique features (ability to fuzzy join, special operators like between or sounds like, ability to transform the keys on the fly, etc) or some of the data step's (known sorted order, arrays, predictable row number, lag function, multiple outputs. etc).

High-Performance SAS Coding - Third Edition

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away