Solved: Re: Proc SQL Performance

shl007 · Posted 03-26-2019 03:32 PM

Hello - I have a dataset with over 126 million rows. I'm running a proc sql on that table to join to another table with far fewer rows. The proc sql never finishes, even with a composite index. Any ideas for trying to improve the performance? I also looked at the site below, and nothing else seemed to help. Thanks for any tips.

https://support.sas.com/documentation/cdl/en/sqlproc/62086/HTML/default/viewer.htm#a001360977.htm

ChrisNZ · Posted 03-27-2019 04:57 PM

Something like:

data WORK.HASH2 ;
  if _N_ = 1 then do;
    if 0 then set LOOKUP (rename=(DATETIME=_DATETIME));
    declare hash DRGS(dataset:'LOOKUP (rename=(DATETIME=_DATETIME))', multidata:'yes');
    DRGS.definekey ('POLICY');
    DRGS.definedata(all:'y');
    DRGS.definedone();
  end;
  set MASTER_BKP ;
  RC = DRGS.find();
  if RC = 0 then do;
    if DATETIME <_DATETIME then output;
    DRGS.has_next(result: R);
    do while(R ne 0);
      RC = DRGS.find_next();
      if DATETIME <_DATETIME then output;
      DRGS.has_next(result: R);
    end;
  end;
run;

or use the iterator object.

High-Performance SAS Coding - Third Edition

View solution in original post

ballardw · Posted 03-26-2019 03:42 PM

It would likely help to at least include the code for the SQL and indicate which is the larger data set. Also if any of the sets involved are from another DBMS such as Oracle or DB2 you should indicate that and possibly include the options used for the connection with the database (no password or user name but other options may be important).

And how long have you waited? Depending on the complexity of the query and the type of join a couple of hours may not be excessive.

Rajesh3 · Posted 03-26-2019 03:46 PM

Hello,

If you are sure that you are not generating cartesian product, maybe you can try an alternative like using arrays or hash objects.

http://support.sas.com/resources/papers/proceedings09/071-2009.pdf

Thanks,

Rajesh.

shl007 · Posted 03-26-2019 03:48 PM

Here is the SQL - "master_bkp" has the 126 million rows. Using proc datasets in a prior step, I had created a composite index on field1 and field2.

proc sql;

create table test as

select *

from master_bkp a, lookup b

where b.field1 = a.field2 and a.field1 < b.field2

;

quit;

Reeza · Posted 03-26-2019 04:02 PM

@shl007 wrote:

Here is the SQL - "master_bkp" has the 126 million rows. Using proc datasets in a prior step, I had created a composite index on field1 and field2.

proc sql;

create table test as

select *

from master_bkp a, lookup b

where b.field1 = a.field2 and a.field1 < b.field2

;

quit;

Looks like a cross join instead of a right join.

None of those tables appear to be on a server either, so is this work happening on a server or your computer? If it's on a desktop with 126 million rows, I'm not surprised it won't complete.

proc sql;

create table test as

select * from master_bkp as a

inner join lookup as b

on a.field2=b.field1 and a.field1 < b.field2;

quit;

shl007 · Posted 03-26-2019 04:04 PM

On the server, not on my desktop. Note: field1 & field2 are key fields in the "lookup" table. Wondering if a hash approach would help.

Reeza · Posted 03-26-2019 04:23 PM

Did you try the modified query?

Try it by limiting obs for 1,000,000 records with each set of code to see which is performing better.

Hash would likely work, but not my strong suit, so you'll have to wait for someone else to help with that.

@shl007 wrote:

On the server, not on my desktop. Note: field1 & field2 are key fields in the "lookup" table. Wondering if a hash approach would help.

shl007 · Posted 03-26-2019 04:35 PM

Tried hash and got this error - still searching for other options, but this suggests hash is not a viable solution in this case.

ERROR: Hash object added 2097136 items when memory failure occurred.

FATAL: Insufficient memory to execute DATA step program. Aborted during the EXECUTION phase.

ERROR: The SAS System stopped processing this step because of insufficient memory.

SASKiwi · Posted 03-26-2019 05:14 PM

Are you using your lookup table as the hash object, not the master table? How many rows are in the lookup table?

shl007 · Posted 03-26-2019 05:26 PM

I'm using the master table as the one with the 126 million rows. Below is the hash code that generated that memory error. Again, running on server. Checking with server admin to see if increasing memsize would be a viable solution without adversely affecting anything. Other than that, at a loss ...

data mastertest;

if _n_=1 then do;

declare hash e(dataset: 'master');

e.definekey('key');

e.definedata('datetime', 'detailid', 'datasetid');

e.definedone();

call missing(datetime, detailid, datasetid);

end;

set lookup (obs=15);

drop rc;

rc=e.find();

if rc=0 then key=catx(' ',datetime, detailid, datasetid);

else key='** Not Found';

run;

ChrisNZ · Posted 03-26-2019 05:30 PM

You need to inverse the position of LOOKUP and MASTER in your code.

Read the MASTER table sequentially with statement SET, and load the LOOKUP table in the hash object.

High-Performance SAS Coding - Third Edition

mkeintz · Posted 03-27-2019 02:17 AM

If your LOOKUP dataset never has duplicates in its key (field1 renamed as _field1 below), then this may be quite simple. Even if there are duplicates, there wouldn't be much change in the code:

data test (drop=_:);
  if _n_=1 then do;
    if 0 then set update (rename=(field1=_field1 field2=_field2));
    declare hash upd (dataset:'update (rename=(field1=_field1 field2=_field2))',hashexp:12);
      upd.definekey('_field1'); 
      upd.definedata(all:'Y');
      upd.definedone();
  end;
  set large;
  if upd.find(key:field2)=0 and field1<_field2;
run;

I use "hashexp:12" to tell sas to use up to 4096 (=2**12) "buckets" for hash object UPD. The default is hashexp:8. The maximum is hashexp:20.

I rename the variables field1 and field2 in data set update, because otherwise a successful find method would overwrite the values for field1 and field2 in the master dataset, which your sql code would not do.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

ChrisNZ · Posted 03-26-2019 05:28 PM

> Tried hash and got this error - still searching for other options, but this suggests hash is not a viable solution in this case.

Maybe not. Several hash merges may well be faster than a single SQL join involving a large sort or overuse of indexes.

High-Performance SAS Coding - Third Edition

PGStats · Posted 03-26-2019 04:47 PM

I'm not sure about this, but it might help to define your index on Field2, Field1 , in that order, on master_bkp and on Field1, Field2, in that order, on lookup, so that your equality join condition is the first index field in both tables.

PG

s_lassen · Posted 03-27-2019 06:55 AM

Here is the SQL - "master_bkp" has the 126 million rows. Using proc datasets in a prior step, I had created a composite index on field1 and field2.

proc sql;

create table test as

select *

from master_bkp a, lookup b

where b.field1 = a.field2 and a.field1 < b.field2

;

quit;

Rather than using a composite index and SQL, I would go with a simple index on FIELD2 in master_bkp, and use a datastep, something like:

proc sql;
  create index field2 on master_bkp(field2);
quit;


data want;
  set lookup;
  lookup_field2=field2; /* save the value */
  do field2=.,field1; /* in case there are duplicates on lookup, refresh with a non-existing key */
    do until(0); /* do forever, or until LEAVE */
      set master_bkp(rename=(field1=master_field1)) key=field2;
      if _iorc_ then do; /* last row read, no more found */
        _error_=0; /* when _iorc_ is set, _error_ is also set, but we do not want an error message */      
        leave;  
        end;
      if field1<lookup_field2 then
        output;
      end;
    end;
run;

I am not sure SAS is that good at using composite indexes for non-identical queries (like your comparison using "<"), but this is probably going to work (not tested, though, no test data). You can try it out with a single obs from lookup first, to check the syntax, and see how it performs.

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!