The first thing to do when this happens is add the _method option on proc sql, to see what joins are used.
You can then decide on the next step, depending on your expectations vs the reality.
Indexes are probably not used, but you don't know at the moment (and maybe they shouldn't be used, depending on the size of the secondary table). You need to know what's happening -and monitor the effects of any change made- before proceeding, Otherwise you are just stabbing in the dark.
Hello - I assume the _method option on proc sql would require that the SQL actually finishes running? That is part of my issue ... Maybe I run it on a subset of the large dataset, but then is that modeling reality? Thanks for all the tips.
Also, on the response mentioning multiple hashes, what do you mean by multiple? Meaning multiple key fields?
I assume the _method option on proc sql would require that the SQL actually finishes running?
The _method option does not require the query to complete. Look it up.
Also, on the response mentioning multiple hashes, what do you mean by multiple? Meaning multiple key fields?
I mean several successive data steps where you only fetch part of the lookup data.
This is in case your lookup data is too large to fit in memory: you look up one half (for example) and then the other half.
If the lookup table fits into memory then the hash approach sounds promising to me.
If you can't get things work then eventually provide some representative sample data (created via SAS data step code) for both your master and lookup table so we can start answering "in code".
The closer the sample data reflects your real data (i.e. variable names and type) the closer the code posted will be to what you need in the end.
Thanks for all the feedback. Using proc sql _method, I determined indexes were NOT being used. So I abandoned indexes as an option and went to hash. Further below is the hash code I'm running.
The "lookup" and "master_bkp" datasets both share common field names: "policy" and "datetime." I'm only using policy as the key.
I'm also comparing "datetime" from the "master_bkp" dataset (which I've renamed to _datetime) to "datetime" from the "lookup" dataset before outputting because I want one to be before the other.
It takes 39 mins. to run, and while no errors are generated, I get the message, Variable _datetime is uninitialized. Is my rename option not taking?
DATA work.hash2 ;
IF _N_ = 0 THEN SET lookup;
DECLARE hash drgs(dataset:'lookup (rename=(datetime=_datetime))', multidata:'yes');
drgs.DEFINEKEY ('policy');
drgs.DEFINEDATA(all:'Y');
drgs.DEFINEDONE();
DO UNTIL (eof_master);
SET master_bkp END = eof_master;
rc = drgs.FIND();
if drgs.find(key:policy)=0 and datetime <_datetime then output;
rc=drgs.FIND_NEXT();
END; STOP; RUN;
1. > Using proc sql _method, I determined indexes were NOT being used.
Did you try using simple indexes to see if they were used?
2. Post the code using {i} or the running man, to avoid smileys
3. You need to define _datetime to the data step. It has no view of what objects do.
Add _datetime=.; after the call to definedone()
or add the rename statement in the first SET.
4. You need to look again how to use the next() method.
Something like:
data WORK.HASH2 ;
if _N_ = 1 then do;
if 0 then set LOOKUP (rename=(DATETIME=_DATETIME));
declare hash DRGS(dataset:'LOOKUP (rename=(DATETIME=_DATETIME))', multidata:'yes');
DRGS.definekey ('POLICY');
DRGS.definedata(all:'y');
DRGS.definedone();
end;
set MASTER_BKP ;
RC = DRGS.find();
if RC = 0 then do;
if DATETIME <_DATETIME then output;
DRGS.has_next(result: R);
do while(R ne 0);
RC = DRGS.find_next();
if DATETIME <_DATETIME then output;
DRGS.has_next(result: R);
end;
end;
run;
or use the iterator object.
@shl007 I see you accepted the solution. What's the performance like now?
25 mins. Which isn't bad; however, when I compare that runtime to looping over 17 datasets individually that feed into the "master" table, the runtime drops by half (under 10 mins). So I'm considering the loop over individual datasets as a more optimal solution ... But we will definitely still use the "hash" object if we need the single master table.
Thanks to EVERYONE who posted in this thread! The user who mentioned proc sql_method, and the user who mentioned to inverse the order of the tables - those in particular helped as well.
Just a hopefully quick follow-up on the code marked as the solution ... Other than doing a separate proc sql or separate data step merge, is there a way within the hash logic to ensure ALL rows are pulled from the LOOKUP dataset even if there aren't matching rows in the MATCHING_BKP dataset?
I know that is a different kind of spin to the LOOKUP dataset than maybe what the hash object expects or is intended to handle. So I'm thinking a separate data step merge or proc sql to show the observations in LOOKUP not in MASTER_BKP. Sample use case for this would be, we want to see rows in the LOOKUP table that were not in the MASTER_BKP (as that would indicate a possible anomaly).
If there is a more efficient way other than a separate proc sql or DATA step outside the hash object, happy to hear it.
Thanks All!
> If there is a more efficient way other than a separate proc sql or DATA step outside the hash object, happy to hear it.
Hash tables are mainly meant to be used for left joins.
The way to do what you want would be to alter the hash table as you go:
- either remove key entries as they are matched (if you know that only need them once)
- or (more probably) update an extra flag field each time a key is matched
You can then export the hash table to a SAS table to get unmatched records (using method output).
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.