Solved: Re: Hash

Ranjeeta · Posted 10-02-2019 12:51 PM

I have a dataset labelled as Final1 of patients with 84 records which has INST_NO i.e. the institution where they received some care

I want to compare the above dataset with another dataset (Inst_dsc) in which I have a list of Institution Nos labelled as INST_AT

How would i write a code to check if the INST_No in the 1st dataset is present in the 2nd dataset

Basically want to flag the cases or records

data match (Keep=INST_No INST_AT);

1227 length inst_AT 8 ;

1228 if _N_ then do;

1229 declare hash h (dataset:"work.inst_dsc");

1230 h.definekey ('INST_AT');

1231 call missing (INST_AT);

1232 end;

1233 set Final1;

1234 rc=h.find(INST_AT);

1235 if rc=0 then output;

1236 run;

ERROR: Uninitialized keys for hash object at line 1234 column 4.

ERROR: DATA STEP Component Object failure. Aborted during the EXECUTION phase.

NOTE: The SAS System stopped processing this step because of errors.

NOTE: There were 1 observations read from the data set Final1.

WARNING: The data set WORK.MATCH may be incomplete. When this step was stopped there were 0 observations and 2 variables.

WARNING: Data set WORK.MATCH was not replaced because this step was stopped.

NOTE: DATA statement used (Total process time):

real time 0.01 seconds

cpu time 0.01 seconds

1237 proc print ;

1238 run;

FreelanceReinh · Posted 10-02-2019 01:34 PM

Moreover, in the FIND method you forgot the "key:" argument tag and used the wrong variable name. It should read

rc=h.find(key:INST_No);

Or simplify the code by combining this with the subsequent IF-THEN statement:

if h.find(key:INST_No)=0 then output;

Once you've corrected this and the issues KurtBremser mentioned (definedata(...) is not required, though, only optional) your program should work. Of course, it doesn't really flag the matching cases, but selects them from dataset Final1. This could be changed easily by replacing the OUTPUT statement with an assignment statement such as flag=1 (and adding flag to the KEEP list).

View solution in original post

Kurt_Bremser · Posted 10-02-2019 01:06 PM

if _N_ then do;

means that you redefine the hash in every data step iteration (all numeric values except missing and zero are true).

Change to

if _N_ = 1 then do;

You are also missing a call to the definedata() (if you have data columns) and definedone() methods.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Ranjeeta · Posted 10-02-2019 02:10 PM

Thankyou Appreciate it

FreelanceReinh · Posted 10-02-2019 02:41 PM

I would like to point out that the primary mistake was the omission of definedone();. This caused the error messages in the log. Had it been inserted, the first error message would have changed to

ERROR: Unknown argument tag at line ... column 4. Tag must be key.

which might have helped you to find the solution. So, the most important response was actually @Kurt_Bremser's.

FreelanceReinh · Posted 10-02-2019 01:34 PM

Moreover, in the FIND method you forgot the "key:" argument tag and used the wrong variable name. It should read

rc=h.find(key:INST_No);

Or simplify the code by combining this with the subsequent IF-THEN statement:

if h.find(key:INST_No)=0 then output;

Once you've corrected this and the issues KurtBremser mentioned (definedata(...) is not required, though, only optional) your program should work. Of course, it doesn't really flag the matching cases, but selects them from dataset Final1. This could be changed easily by replacing the OUTPUT statement with an assignment statement such as flag=1 (and adding flag to the KEEP list).

hashman · Posted 10-03-2019 12:18 AM

@Ranjeeta:

Your entire step can be condensed to this:

data match ;                           
  if _n_ = 1 then do ;                 
    dcl hash h (dataset:"inst_dsc") ;  
    h.definekey ("inst_at") ;          
    h.definedone () ;                  
  end ;                                
  set final1 (keep = inst_no inst_at) ;
  if h.check() = 0 ;                    
run ;

Note:

You need neither LENGTH nor CALL MISSING, as the hash key is on file FINAL1, and parameter type matching (i.e. setting the hash host PDV variables) occurs when the compiler sees the SET reference to FINAL1. Most of the time, LENGTH and CALL MISSING aren't needed - they are just leftovers from the maladroitness in the documentation.
As @FreelanceReinh and @Kurt_Bremser have pointed out, _N_=1 is the condition you need to avoid (a) creating a new hash object instant at every observation read from FINAL1 and (b) loading the hash table at every observation.
Keep the only variables you need on input, not on output, to prevent polluting the PDV with extraneous stuff and degrading performance.
There's no need to use an assigned (i.e. RC=) method call because a call embedded in the IF statement prevents generating KEY NOT FOUND errors if there's no match.
Likewise, there's no need for the OUTPUT statement since you can just use the subsetting IF.
Since you don't need to retrieve any satellite information from the data portion of the hash table, you needn't use the FIND method. You're only checking if there's a match; so the CHECK method will suffice, not to mention that it's faster than FIND for the obvious reason that it doesn't entail the extra burden of moving data from the hash table to the PDV host variables.

Kind regards

Paul D.

Kurt_Bremser · Posted 10-03-2019 01:25 AM

One thing I might change:

if h.check() = 0 ;

to

if not h.check();

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

hashman · Posted 10-03-2019 02:27 AM

@Kurt_Bremser:

Yes, it can be coded this way. However, unfortunately, the way SAS designed the return codes is counter-intuitive from the standpoint of the normal SAS Boolean conventions. Hence, for a SAS programmer not quite familiar with the way the hash methods' return codes are set up, IF NOT H.FIND() is semantically perceived as "if key not found", while it's just the opposite.

This incongruence has caused quite a bit of costernation to my coauthor @DonH when he was loading himself with the SAS hash knowledge for the same reason as to me: Naturally, he and I would much rather prefer to have RC=1 if the call is successful and RC=0 otherwise. If that were the case, IF H.FIND() would mean "found" and IF NOT H.FIND() would mean "not found", and no confusion would arise.

More unfortunately, some hash object methods offer argument tags assigned a variable name whose value is set to 1 in case of success and to 0 in case of failure, just the way the things ought to be. For all these reasons, I prefer to code return codes explicitly rather than to turn them into Booleans, just to avoid disorientation both for myself and those reading my hash code.

Kind regards

Paul D.

FreelanceReinh · Posted 10-03-2019 04:49 AM

@hashman: Very clear explanations, as always. I may be wrong, but my understanding was that variable inst_at is in dataset inst_dsc, but not in dataset Final1. In this case I would insert

  inst_at=inst_no;

before the subsetting IF statement to support the simplifications you made to the code (and, of course, remove inst_at from the KEEP list).

hashman · Posted 10-03-2019 02:06 PM

@FreelanceReinh:

You may very well be correct on that head.

SAS Training: Just a Click Away