Solved: Proc SQL Performance - Page 2

ChrisNZ · Posted 03-26-2019 05:24 PM

The first thing to do when this happens is add the _method option on proc sql, to see what joins are used.

You can then decide on the next step, depending on your expectations vs the reality.

Indexes are probably not used, but you don't know at the moment (and maybe they shouldn't be used, depending on the size of the secondary table). You need to know what's happening -and monitor the effects of any change made- before proceeding, Otherwise you are just stabbing in the dark.

High-Performance SAS Coding - Third Edition

shl007 · Posted 03-26-2019 05:49 PM

Hello - I assume the _method option on proc sql would require that the SQL actually finishes running? That is part of my issue ... Maybe I run it on a subset of the large dataset, but then is that modeling reality? Thanks for all the tips.

Also, on the response mentioning multiple hashes, what do you mean by multiple? Meaning multiple key fields?

ChrisNZ · Posted 03-26-2019 06:30 PM

I assume the _method option on proc sql would require that the SQL actually finishes running?

The _method option does not require the query to complete. Look it up.

Also, on the response mentioning multiple hashes, what do you mean by multiple? Meaning multiple key fields?

I mean several successive data steps where you only fetch part of the lookup data.

This is in case your lookup data is too large to fit in memory: you look up one half (for example) and then the other half.

High-Performance SAS Coding - Third Edition

Patrick · Posted 03-26-2019 06:45 PM

@shl007

If the lookup table fits into memory then the hash approach sounds promising to me.

If you can't get things work then eventually provide some representative sample data (created via SAS data step code) for both your master and lookup table so we can start answering "in code".

The closer the sample data reflects your real data (i.e. variable names and type) the closer the code posted will be to what you need in the end.

shl007 · Posted 03-27-2019 01:31 PM

Thanks for all the feedback. Using proc sql _method, I determined indexes were NOT being used. So I abandoned indexes as an option and went to hash. Further below is the hash code I'm running.

The "lookup" and "master_bkp" datasets both share common field names: "policy" and "datetime." I'm only using policy as the key.

I'm also comparing "datetime" from the "master_bkp" dataset (which I've renamed to _datetime) to "datetime" from the "lookup" dataset before outputting because I want one to be before the other.

It takes 39 mins. to run, and while no errors are generated, I get the message, Variable _datetime is uninitialized. Is my rename option not taking?

DATA work.hash2 ;

IF _N_ = 0 THEN SET lookup;

DECLARE hash drgs(dataset:'lookup (rename=(datetime=_datetime))', multidata:'yes');

drgs.DEFINEKEY ('policy');

drgs.DEFINEDATA(all:'Y');

drgs.DEFINEDONE();

DO UNTIL (eof_master);

SET master_bkp END = eof_master;

rc = drgs.FIND();

if drgs.find(key:policy)=0 and datetime <_datetime then output;

rc=drgs.FIND_NEXT();

END; STOP; RUN;

ChrisNZ · Posted 03-27-2019 04:49 PM

1. > Using proc sql _method, I determined indexes were NOT being used.

Did you try using simple indexes to see if they were used?

2. Post the code using {i} or the running man, to avoid smileys

3. You need to define _datetime to the data step. It has no view of what objects do.

Add _datetime=.; after the call to definedone()

or add the rename statement in the first SET.

4. You need to look again how to use the next() method.

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 03-27-2019 04:57 PM

Something like:

data WORK.HASH2 ;
  if _N_ = 1 then do;
    if 0 then set LOOKUP (rename=(DATETIME=_DATETIME));
    declare hash DRGS(dataset:'LOOKUP (rename=(DATETIME=_DATETIME))', multidata:'yes');
    DRGS.definekey ('POLICY');
    DRGS.definedata(all:'y');
    DRGS.definedone();
  end;
  set MASTER_BKP ;
  RC = DRGS.find();
  if RC = 0 then do;
    if DATETIME <_DATETIME then output;
    DRGS.has_next(result: R);
    do while(R ne 0);
      RC = DRGS.find_next();
      if DATETIME <_DATETIME then output;
      DRGS.has_next(result: R);
    end;
  end;
run;

or use the iterator object.

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 03-28-2019 12:10 AM

@shl007 I see you accepted the solution. What's the performance like now?

High-Performance SAS Coding - Third Edition

shl007 · Posted 03-28-2019 07:15 AM

25 mins. Which isn't bad; however, when I compare that runtime to looping over 17 datasets individually that feed into the "master" table, the runtime drops by half (under 10 mins). So I'm considering the loop over individual datasets as a more optimal solution ... But we will definitely still use the "hash" object if we need the single master table.

Thanks to EVERYONE who posted in this thread! The user who mentioned proc sql_method, and the user who mentioned to inverse the order of the tables - those in particular helped as well.

shl007 · Posted 03-28-2019 04:56 PM

Just a hopefully quick follow-up on the code marked as the solution ... Other than doing a separate proc sql or separate data step merge, is there a way within the hash logic to ensure ALL rows are pulled from the LOOKUP dataset even if there aren't matching rows in the MATCHING_BKP dataset?

I know that is a different kind of spin to the LOOKUP dataset than maybe what the hash object expects or is intended to handle. So I'm thinking a separate data step merge or proc sql to show the observations in LOOKUP not in MASTER_BKP. Sample use case for this would be, we want to see rows in the LOOKUP table that were not in the MASTER_BKP (as that would indicate a possible anomaly).

If there is a more efficient way other than a separate proc sql or DATA step outside the hash object, happy to hear it.

Thanks All!

ChrisNZ · Posted 03-28-2019 06:19 PM

> If there is a more efficient way other than a separate proc sql or DATA step outside the hash object, happy to hear it.

Hash tables are mainly meant to be used for left joins.

The way to do what you want would be to alter the hash table as you go:

- either remove key entries as they are matched (if you know that only need them once)

- or (more probably) update an extra flag field each time a key is matched

You can then export the hash table to a SAS table to get unmatched records (using method output).

High-Performance SAS Coding - Third Edition

LinusH · Posted 03-27-2019 01:26 PM

Adding
Options msglevel=i;
will let th know if the indexed is being used.
Also adding
PROC SQL _method;
will let you know what query strategy is being used.

Data never sleeps

Re: Proc SQL Performance

Re: Proc SQL Performance

Re: Proc SQL Performance

Re: Proc SQL Performance

Re: Proc SQL Performance

Re: Proc SQL Performance

Re: Proc SQL Performance

Re: Proc SQL Performance

Re: Proc SQL Performance

Re: Proc SQL Performance

Re: Proc SQL Performance

Re: Proc SQL Performance

Classroom Training Available!