BookmarkSubscribeRSS Feed
cbrotz
Pyrite | Level 9

I am looking for a different way to write the code below.  It does run but very slow as the files are large.

It does run with the where statement in a separate step but before the where, the file is 774,000,000 records and the step runs for 2 hours. Any tips would be appreciated.  Thanks

 

FROM WORK.SELECRCS1 t1 LEFT JOIN WORK.APPEND_TABLE_0003 t2 ON (t1.LEVEL3 = t2.LEVEL3) AND (t1.LEVEL2 = t2.LEVEL2)
AND (t1.LEVEL1 = t2.LEVEL1) AND (t1.PROMO_CD = t2.PROMO_CODE)
WHERE t2.LEVEL3 IS NOT MISSING;
QUIT;

14 REPLIES 14
mkeintz
PROC Star

 

 

  1. Are one or both of the datasets sorted on some combination of LEVEL_1/LEVEL_2/LEVEL_3/ (PROMO_CD or PROMO_CODE)?

  2. Do either of the datasets have only unique combinations of  LEVEL_1/LEVEL_2/LEVEL_3/ (PROMO_CD or PROMO_CODE)?

  3. What are the relative sizes of the two datasets?

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
cbrotz
Pyrite | Level 9

I did not sort either data set before the SQL step.  One file is around 152 million and the other is around 12 million

I could do a sort beforehand if that would help.  It already runs so long I did not think sorting would help but I can give that a try?

ChrisNZ
Tourmaline | Level 20

proc sql will probably sort behind the scene, so sorting before hand might save you time, especially if you need that sort order more than once when using these tables.

And if that the case, adding order by  might slow down proc sort a bit, but might make the next step faster. 

Also explore the tagsort option if memory is an issue.

 

Another way to speed a sort is to put the utility location and the work locations on separate drives. These are set when SAS starts.

 

Another option is to create an index on the larger table. No need for the small table since you keep all the records.

This is only good if you keep a small portion of the indexed table (typically less than 5 to 15%).

 

Another way is to read the unsorted large table sequentially in a data step, and load the small table in a hash table to perform the match.

 

Another way is to store the source data using the SPDE engine instead of the base engine: SPDE is very good at sorting on the fly.

Astounding
PROC Star

Sorting might be critical, but not for a SQL approach. 

 

Here is what a different approach could look like:

data want;
   set t1 (rename=(promo_cd = promo_code) in=found_in_t1) t2 (in=fromt2);
   by level1 level2 level3 promo_code;
   if first.promo_code then keepme = found_in_t1;
   if keepme;
   retain keepme;
   if fromt2;
run;

   

It's untested code, but its accuracy relies on two points.  First, the data sets have to be sorted (both of them).  Second, the T2 data set has to be supplying the identifiers only not additional variables.  If additional variables are involved coming from t2, the task can still be done but it gets a little longer.

 

I believe the speed here will be much faster than anything else, since there is only one SET statement.  But you need to confirm the conditions we are working under ... does T2 need to supply additional variables, and can we sort the data sets.

ChrisNZ
Tourmaline | Level 20

On sorted data sets, data steps are not necessarily faster than SQL.

data T1(sortedby=I);
  do I=1 to 100e6;
   output; 
  end;
run;
data T2(sortedby=I);
  do I=1 to 100e6 by 10;
   output; 
  end;
run;
proc sql _method; 
  create table T as select T1.I from T1, T2 where T1.I=T2.I;
quit;
data T;  
  merge T1 T2(in=B); 
  by I;
  if B;
run;
26         proc sql _method;
27           create table T as select T1.I from T1, T2 where T1.I=T2.I;

NOTE: SQL execution methods chosen are:

      sqxcrta
          sqxjm
              sqxsrc( WORK.T2 )
              sqxsrc( WORK.T1 )
NOTE: Compression was disabled for data set WORK.T because compression overhead would increase the size of the data set.
NOTE: Table WORK.T created, with 10000000 rows and 1 columns.

28         quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           12.88 seconds
      cpu time            12.89 seconds
      

29         data T;
30           merge T1 T2(in=B);
31           by I;
32           if B;
33         run;

NOTE: Compression was disabled for data set WORK.T because compression overhead would increase the size of the data set.
NOTE: There were 100000000 observations read from the data set WORK.T1.
NOTE: There were 10000000 observations read from the data set WORK.T2.
NOTE: The data set WORK.T has 10000000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
      real time           14.67 seconds
      cpu time            14.67 seconds
Astounding
PROC Star

@ChrisNZ ,

 

Obviously you are correct about a MERGE statement:

data T;  
  merge T1 T2(in=B); 
  by I;
  if B;
run;

But try it with a SET statement instead:

data T;  
  set T1 (in=A) T2(in=B); 
  by I;
  if first.i then keepme=A;
  retain keepme;
  if B and keepme;
run;

I would expect the time to be cut in half (roughly).

cbrotz
Pyrite | Level 9

In addition - one file has unique combinations of the joining fields, the other does not and that is why the file ends  up being so large.  We want those multiple joins to split credit so that is actually ok in this scenario.

SASKiwi
PROC Star

The performance of SAS WORK table joins will be largely limited by the IO performance of the SAS App server WORK drive. It would be helpful if you could post the full SAS log of your query. If the query CPU usage is considerably less than the elapsed time then you can be pretty sure it is IO-constrained. If you are processing a lot of character columns, then setting this option may speed things up - OPTIONS compress = yes; - this will compress your tables so they take up less space and so improve IO. 

Kurt_Bremser
Super User

Try this:

data want;
set selecrcs1;
if _n_ = 1
then do;
  declare hash t2 (dataset:"append_table_003 (rename=(promo_code=promo_cd))");
  t2.definekey("level1","level2","level3","promo_cd");
  t2.definedone();
end;
if t2.check() = 0;
run;

It might fail with a memory overflow because of the size of the second table, but it's worth a try.

ChrisNZ
Tourmaline | Level 20

That's the best option if the small table is only used for subsetting the large one, or if the amount of data retrieved is small (up to the memory available) as none of the tables need sorting.

Note that if you only subset, you should specify a data portion for the hash table of 8 bytes maximum in order to reduce memory footprint.

 

 

Kurt_Bremser
Super User

@ChrisNZ wrote:

 

Note that if you only subset, you should specify a data portion for the hash table of 8 bytes maximum in order to reduce memory footprint.


That is an important tip. I will test that today on the AIX platfirm, just to see if that effect spans across operating systems.

cbrotz
Pyrite | Level 9
Thank you for the tips! I like the SAS code version and do not need to necessarily add any fields in this step. This takes a while to test but I will respond later today.
ChrisNZ
Tourmaline | Level 20

Yes, memory footprint goes up in increments of 16 bytes per item, so one byte too many and that's 16 bytes added. 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 14 replies
  • 4812 views
  • 8 likes
  • 6 in conversation