Re: Tips for Dealing with a Large Table Join and Run Time

cbrotz · Posted 11-12-2021 04:53 PM

I am looking for a different way to write the code below. It does run but very slow as the files are large.

It does run with the where statement in a separate step but before the where, the file is 774,000,000 records and the step runs for 2 hours. Any tips would be appreciated. Thanks

FROM WORK.SELECRCS1 t1 LEFT JOIN WORK.APPEND_TABLE_0003 t2 ON (t1.LEVEL3 = t2.LEVEL3) AND (t1.LEVEL2 = t2.LEVEL2)
AND (t1.LEVEL1 = t2.LEVEL1) AND (t1.PROMO_CD = t2.PROMO_CODE)
WHERE t2.LEVEL3 IS NOT MISSING;
QUIT;

mkeintz · Posted 11-12-2021 05:41 PM

Are one or both of the datasets sorted on some combination of LEVEL_1/LEVEL_2/LEVEL_3/ (PROMO_CD or PROMO_CODE)?
Do either of the datasets have only unique combinations of LEVEL_1/LEVEL_2/LEVEL_3/ (PROMO_CD or PROMO_CODE)?
What are the relative sizes of the two datasets?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

cbrotz · Posted 11-12-2021 05:53 PM

I did not sort either data set before the SQL step. One file is around 152 million and the other is around 12 million

I could do a sort beforehand if that would help. It already runs so long I did not think sorting would help but I can give that a try?

ChrisNZ · Posted 11-12-2021 09:21 PM

proc sql will probably sort behind the scene, so sorting before hand might save you time, especially if you need that sort order more than once when using these tables.

And if that the case, adding order by might slow down proc sort a bit, but might make the next step faster.

Also explore the tagsort option if memory is an issue.

Another way to speed a sort is to put the utility location and the work locations on separate drives. These are set when SAS starts.

Another option is to create an index on the larger table. No need for the small table since you keep all the records.

This is only good if you keep a small portion of the indexed table (typically less than 5 to 15%).

Another way is to read the unsorted large table sequentially in a data step, and load the small table in a hash table to perform the match.

Another way is to store the source data using the SPDE engine instead of the base engine: SPDE is very good at sorting on the fly.

High-Performance SAS Coding - Third Edition

Astounding · Posted 11-13-2021 05:29 PM

Sorting might be critical, but not for a SQL approach.

Here is what a different approach could look like:

data want;
   set t1 (rename=(promo_cd = promo_code) in=found_in_t1) t2 (in=fromt2);
   by level1 level2 level3 promo_code;
   if first.promo_code then keepme = found_in_t1;
   if keepme;
   retain keepme;
   if fromt2;
run;

It's untested code, but its accuracy relies on two points. First, the data sets have to be sorted (both of them). Second, the T2 data set has to be supplying the identifiers only not additional variables. If additional variables are involved coming from t2, the task can still be done but it gets a little longer.

I believe the speed here will be much faster than anything else, since there is only one SET statement. But you need to confirm the conditions we are working under ... does T2 need to supply additional variables, and can we sort the data sets.

ChrisNZ · Posted 11-15-2021 11:47 PM

On sorted data sets, data steps are not necessarily faster than SQL.

data T1(sortedby=I);
  do I=1 to 100e6;
   output; 
  end;
run;
data T2(sortedby=I);
  do I=1 to 100e6 by 10;
   output; 
  end;
run;
proc sql _method; 
  create table T as select T1.I from T1, T2 where T1.I=T2.I;
quit;
data T;  
  merge T1 T2(in=B); 
  by I;
  if B;
run;

26         proc sql _method;
27           create table T as select T1.I from T1, T2 where T1.I=T2.I;

NOTE: SQL execution methods chosen are:

      sqxcrta
          sqxjm
              sqxsrc( WORK.T2 )
              sqxsrc( WORK.T1 )
NOTE: Compression was disabled for data set WORK.T because compression overhead would increase the size of the data set.
NOTE: Table WORK.T created, with 10000000 rows and 1 columns.

28         quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           12.88 seconds
      cpu time            12.89 seconds
      

29         data T;
30           merge T1 T2(in=B);
31           by I;
32           if B;
33         run;

NOTE: Compression was disabled for data set WORK.T because compression overhead would increase the size of the data set.
NOTE: There were 100000000 observations read from the data set WORK.T1.
NOTE: There were 10000000 observations read from the data set WORK.T2.
NOTE: The data set WORK.T has 10000000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
      real time           14.67 seconds
      cpu time            14.67 seconds

High-Performance SAS Coding - Third Edition

Astounding · Posted 11-16-2021 08:06 AM

@ChrisNZ ,

Obviously you are correct about a MERGE statement:

data T;  
  merge T1 T2(in=B); 
  by I;
  if B;
run;

But try it with a SET statement instead:

data T;  
  set T1 (in=A) T2(in=B); 
  by I;
  if first.i then keepme=A;
  retain keepme;
  if B and keepme;
run;

I would expect the time to be cut in half (roughly).

cbrotz · Posted 11-12-2021 05:55 PM

In addition - one file has unique combinations of the joining fields, the other does not and that is why the file ends up being so large. We want those multiple joins to split credit so that is actually ok in this scenario.

SASKiwi · Posted 11-12-2021 08:06 PM

The performance of SAS WORK table joins will be largely limited by the IO performance of the SAS App server WORK drive. It would be helpful if you could post the full SAS log of your query. If the query CPU usage is considerably less than the elapsed time then you can be pretty sure it is IO-constrained. If you are processing a lot of character columns, then setting this option may speed things up - OPTIONS compress = yes; - this will compress your tables so they take up less space and so improve IO.

Kurt_Bremser · Posted 11-13-2021 07:15 AM

Try this:

data want;
set selecrcs1;
if _n_ = 1
then do;
  declare hash t2 (dataset:"append_table_003 (rename=(promo_code=promo_cd))");
  t2.definekey("level1","level2","level3","promo_cd");
  t2.definedone();
end;
if t2.check() = 0;
run;

It might fail with a memory overflow because of the size of the second table, but it's worth a try.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ChrisNZ · Posted 11-14-2021 11:05 PM

That's the best option if the small table is only used for subsetting the large one, or if the amount of data retrieved is small (up to the memory available) as none of the tables need sorting.

Note that if you only subset, you should specify a data portion for the hash table of 8 bytes maximum in order to reduce memory footprint.

High-Performance SAS Coding - Third Edition

Kurt_Bremser · Posted 11-15-2021 01:56 AM

@ChrisNZ wrote:

Note that if you only subset, you should specify a data portion for the hash table of 8 bytes maximum in order to reduce memory footprint.

That is an important tip. I will test that today on the AIX platfirm, just to see if that effect spans across operating systems.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

cbrotz · Posted 11-15-2021 08:10 AM

Thank you for the tips! I like the SAS code version and do not need to necessarily add any fields in this step. This takes a while to test but I will respond later today.

Kurt_Bremser · Posted 11-16-2021 08:39 AM

Did the test; there is a noticeable reduction in memory footprint when I define one of the key variables that has a length of 8 as data.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ChrisNZ · Posted 11-16-2021 10:44 PM

Yes, memory footprint goes up in increments of 16 bytes per item, so one byte too many and that's 16 bytes added.

High-Performance SAS Coding - Third Edition

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away