BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
ChrisNZ
Tourmaline | Level 20

 

The first thing to do when this happens is add the _method option on proc sql, to see what joins are used.

 

You can then decide on the next step, depending on your expectations vs the reality.

 

Indexes are probably not used, but you don't know at the moment (and maybe they shouldn't be used, depending on the size of the secondary table). You need to know what's happening -and monitor the effects of any change made- before proceeding, Otherwise you are just stabbing in the dark.

 

 

 

shl007
Obsidian | Level 7

Hello - I assume the _method option on proc sql would require that the SQL actually finishes running? That is part of my issue ... Maybe I run it on a subset of the large dataset, but then is that modeling reality? Thanks for all the tips.

 

Also, on the response mentioning multiple hashes, what do you mean by multiple? Meaning multiple key fields?

ChrisNZ
Tourmaline | Level 20

I assume the _method option on proc sql would require that the SQL actually finishes running? 

The _method option does not require the query to complete. Look it up.

 

Also, on the response mentioning multiple hashes, what do you mean by multiple? Meaning multiple key fields?

I mean several successive data steps where you only fetch part of the lookup data.

This is in case your lookup data is too large to fit in memory: you look up one half (for example) and then the other half.

 

Patrick
Opal | Level 21

@shl007 

If the lookup table fits into memory then the hash approach sounds promising to me. 

If you can't get things work then eventually provide some representative sample data (created via SAS data step code) for both your master and lookup table so we can start answering "in code".

The closer the sample data reflects your real data (i.e. variable names and type) the closer the code posted will be to what you need in the end.

shl007
Obsidian | Level 7

Thanks for all the feedback. Using proc sql _method, I determined indexes were NOT being used. So I abandoned indexes as an option and went to hash. Further below is the hash code I'm running.

 

The "lookup" and "master_bkp" datasets both share common field names: "policy" and "datetime." I'm only using policy as the key.

 

I'm also comparing "datetime" from the "master_bkp" dataset (which I've renamed to _datetime)  to "datetime" from the "lookup" dataset before outputting because I want one to be before the other.

 

It takes 39 mins. to run, and while no errors are generated, I get the message, Variable _datetime is uninitialized. Is my rename option not taking?

 

DATA work.hash2 ;

IF _N_ = 0 THEN SET lookup;

DECLARE hash drgs(dataset:'lookup (rename=(datetime=_datetime))', multidata:'yes');

drgs.DEFINEKEY ('policy');

drgs.DEFINEDATA(all:'Y');

 

drgs.DEFINEDONE();

DO UNTIL (eof_master);

SET master_bkp END = eof_master;

 

rc = drgs.FIND();

 

if drgs.find(key:policy)=0 and datetime <_datetime then output;

 

rc=drgs.FIND_NEXT();

 

END; STOP; RUN;

ChrisNZ
Tourmaline | Level 20

1. > Using proc sql _method, I determined indexes were NOT being used. 

Did you try using simple indexes to see if they were used?

 

2. Post the code using {i} or the running man, to avoid smileys

 

3. You need to define _datetime to the data step. It has no view of what objects do.

Add  _datetime=.;   after the call to definedone()

or add the rename statement in the first SET.

 

4. You need to look again how to use the next() method.

 

ChrisNZ
Tourmaline | Level 20

Something like:

 

data WORK.HASH2 ;
  if _N_ = 1 then do;
    if 0 then set LOOKUP (rename=(DATETIME=_DATETIME));
    declare hash DRGS(dataset:'LOOKUP (rename=(DATETIME=_DATETIME))', multidata:'yes');
    DRGS.definekey ('POLICY');
    DRGS.definedata(all:'y');
    DRGS.definedone();
  end;
  set MASTER_BKP ;
  RC = DRGS.find();
  if RC = 0 then do;
    if DATETIME <_DATETIME then output;
    DRGS.has_next(result: R);
    do while(R ne 0);
      RC = DRGS.find_next();
      if DATETIME <_DATETIME then output;
      DRGS.has_next(result: R);
    end;
  end;
run;

or use the iterator object.

 

ChrisNZ
Tourmaline | Level 20

@shl007  I see you accepted the solution. What's the performance like now?

shl007
Obsidian | Level 7

25 mins. Which isn't bad; however, when I compare that runtime to looping over 17 datasets individually that feed into the "master" table, the runtime drops by half (under 10 mins). So I'm considering the loop over individual datasets as a more optimal solution ... But we will definitely still use the "hash" object if we need the single master table.

 

Thanks to EVERYONE who posted in this thread!  The user who mentioned proc sql_method, and the user who mentioned to inverse the order of the tables - those in particular helped as well.

shl007
Obsidian | Level 7

Just a hopefully quick follow-up on the code marked as the solution ... Other than doing a separate proc sql or separate data step merge, is there a way within the hash logic to ensure ALL rows are pulled from the LOOKUP dataset even if there aren't matching rows in the MATCHING_BKP dataset?

 

I know that is a different kind of spin to the LOOKUP dataset than maybe what the hash object expects or is intended to handle. So I'm thinking a separate data step merge or proc sql to show the observations in LOOKUP not in MASTER_BKP. Sample use case for this would be, we want to see rows in the LOOKUP table that were not in the MASTER_BKP (as that would indicate a possible anomaly).

 

If there is a more efficient way other than a separate proc sql or DATA step outside the hash object, happy to hear it.

 

Thanks All!

ChrisNZ
Tourmaline | Level 20

> If there is a more efficient way other than a separate proc sql or DATA step outside the hash object, happy to hear it.

Hash tables are mainly meant to be used for left joins.

The way to do what you want would be to alter the hash table as you go:

- either remove key entries as they are matched (if you know that only need them once)

- or (more probably) update an extra flag field each time a key is matched 

You can then export the hash table to a SAS table to get unmatched records (using method output).

LinusH
Tourmaline | Level 20
Adding
Options msglevel=i;
will let th know if the indexed is being used.
Also adding
PROC SQL _method;
will let you know what query strategy is being used.
Data never sleeps

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 26 replies
  • 1820 views
  • 13 likes
  • 11 in conversation