BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
shl007
Obsidian | Level 7

Hello - I have a dataset with over 126 million rows. I'm running a proc sql on that table to join to another table with far fewer rows. The proc sql never finishes, even with a composite index. Any ideas for trying to improve the performance? I also looked at the site below, and nothing else seemed to help. Thanks for any tips.

 

https://support.sas.com/documentation/cdl/en/sqlproc/62086/HTML/default/viewer.htm#a001360977.htm

 

1 ACCEPTED SOLUTION

Accepted Solutions
ChrisNZ
Tourmaline | Level 20

Something like:

 

data WORK.HASH2 ;
  if _N_ = 1 then do;
    if 0 then set LOOKUP (rename=(DATETIME=_DATETIME));
    declare hash DRGS(dataset:'LOOKUP (rename=(DATETIME=_DATETIME))', multidata:'yes');
    DRGS.definekey ('POLICY');
    DRGS.definedata(all:'y');
    DRGS.definedone();
  end;
  set MASTER_BKP ;
  RC = DRGS.find();
  if RC = 0 then do;
    if DATETIME <_DATETIME then output;
    DRGS.has_next(result: R);
    do while(R ne 0);
      RC = DRGS.find_next();
      if DATETIME <_DATETIME then output;
      DRGS.has_next(result: R);
    end;
  end;
run;

or use the iterator object.

 

View solution in original post

26 REPLIES 26
ballardw
Super User

It would likely help to at least include the code for the SQL and indicate which is the larger data set. Also if any of the sets involved are from another DBMS such as Oracle or DB2 you should indicate that and possibly include the options used for the connection with the database (no password or user name but other options may be important).

 

And how long have you waited? Depending on the complexity of the query and the type of join a couple of hours may not be excessive.

 

 

Rajesh3
Obsidian | Level 7

Hello,

 

If you are sure that you are not generating cartesian product, maybe you can try an alternative like using arrays or hash objects.

 

http://support.sas.com/resources/papers/proceedings09/071-2009.pdf

 

Thanks,

Rajesh.

shl007
Obsidian | Level 7

Here is the SQL - "master_bkp" has the 126 million rows. Using proc datasets in a prior step, I had created a composite index on field1 and field2.

 

proc sql;

create table test as

select *

from master_bkp a, lookup b

where b.field1 = a.field2 and a.field1 < b.field2

;

quit;

Reeza
Super User

@shl007 wrote:

Here is the SQL - "master_bkp" has the 126 million rows. Using proc datasets in a prior step, I had created a composite index on field1 and field2.

 

proc sql;

create table test as

select *

from master_bkp a, lookup b

where b.field1 = a.field2 and a.field1 < b.field2

;

quit;


Looks like a cross join instead of a right join. 

None of those tables appear to be on a server either, so is this work happening on a server or your computer? If it's on a desktop with 126 million rows, I'm not surprised it won't complete.

 

proc sql;

create table test as

select * from master_bkp as a

inner join lookup as b

on a.field2=b.field1 and a.field1 < b.field2;

quit;
shl007
Obsidian | Level 7

On the server, not on my desktop. Note: field1 & field2 are key fields in the "lookup" table. Wondering if a hash approach would help.

Reeza
Super User

Did you try the modified query?

 

Try it by limiting obs for 1,000,000 records with each set of code to see which is performing better. 

 

Hash would likely work, but not my strong suit, so you'll have to wait for someone else to help with that.

 


@shl007 wrote:

On the server, not on my desktop. Note: field1 & field2 are key fields in the "lookup" table. Wondering if a hash approach would help.


 

shl007
Obsidian | Level 7

Tried hash and got this error - still searching for other options, but this suggests hash is not a viable solution in this case.

 

ERROR: Hash object added 2097136 items when memory failure occurred.

FATAL: Insufficient memory to execute DATA step program. Aborted during the EXECUTION phase.

ERROR: The SAS System stopped processing this step because of insufficient memory.

SASKiwi
PROC Star

Are you using your lookup table as the hash object, not the master table? How many rows are in the lookup table?

shl007
Obsidian | Level 7

I'm using the master table as the one with the 126 million rows. Below is the hash code that generated that memory error. Again, running on server. Checking with server admin to see if increasing memsize would be a viable solution without adversely affecting anything. Other than that, at a loss ...

 

data mastertest;

 

if _n_=1 then do;

declare hash e(dataset: 'master');

e.definekey('key');

e.definedata('datetime', 'detailid', 'datasetid');

 

e.definedone();

call missing(datetime, detailid, datasetid);

end;

set lookup (obs=15);

drop rc;

 

rc=e.find();

if rc=0 then key=catx(' ',datetime, detailid, datasetid);

else key='** Not Found';

run;

 

ChrisNZ
Tourmaline | Level 20

You need to inverse the position of LOOKUP and MASTER in your code.

 

Read the MASTER table sequentially with statement SET, and load the LOOKUP table in the hash object.

mkeintz
PROC Star

If your LOOKUP dataset never has duplicates in its key (field1 renamed as _field1 below), then this may be quite simple.  Even if there are duplicates, there wouldn't be much change in the code:

 

data test (drop=_:);
  if _n_=1 then do;
    if 0 then set update (rename=(field1=_field1 field2=_field2));
    declare hash upd (dataset:'update (rename=(field1=_field1 field2=_field2))',hashexp:12);
      upd.definekey('_field1'); 
      upd.definedata(all:'Y');
      upd.definedone();
  end;
  set large;
  if upd.find(key:field2)=0 and field1<_field2;
run;

I use "hashexp:12" to tell sas to use up to 4096 (=2**12)  "buckets" for hash object UPD.  The default is hashexp:8.  The maximum is hashexp:20.

 

I rename the variables field1 and field2 in data set update, because otherwise a successful find method would overwrite the values for field1 and field2 in the master dataset, which your sql code would not do.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
ChrisNZ
Tourmaline | Level 20

> Tried hash and got this error - still searching for other options, but this suggests hash is not a viable solution in this case.

 

Maybe not. Several hash merges may well be faster than a single SQL join involving a large sort or overuse of indexes.

PGStats
Opal | Level 21

I'm not sure about this, but it might help to define your index on Field2, Field1 , in that order, on master_bkp and on Field1, Field2, in that order, on lookup, so that your equality join condition is the first index field in both tables.

PG
s_lassen
Meteorite | Level 14

Here is the SQL - "master_bkp" has the 126 million rows. Using proc datasets in a prior step, I had created a composite index on field1 and field2.

 

proc sql;

create table test as

select *

from master_bkp a, lookup b

where b.field1 = a.field2 and a.field1 < b.field2

;

quit;

 

Rather than using a composite index and SQL, I would go with a simple index on FIELD2 in master_bkp, and use a datastep, something like:

 

proc sql;
  create index field2 on master_bkp(field2);
quit;


data want;
  set lookup;
  lookup_field2=field2; /* save the value */
  do field2=.,field1; /* in case there are duplicates on lookup, refresh with a non-existing key */
    do until(0); /* do forever, or until LEAVE */
      set master_bkp(rename=(field1=master_field1)) key=field2;
      if _iorc_ then do; /* last row read, no more found */
        _error_=0; /* when _iorc_ is set, _error_ is also set, but we do not want an error message */      
        leave;  
        end;
      if field1<lookup_field2 then
        output;
      end;
    end;
run; 

I am not sure SAS is that good at using composite indexes for non-identical queries (like your comparison using "<"), but this is probably going to work (not tested, though, no test data). You can try it out with a single obs from lookup first, to check the syntax, and see how it performs.

 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 26 replies
  • 2096 views
  • 13 likes
  • 11 in conversation