Solved: question about hash method, SAS PG3

dxiao2017 · Posted 07-04-2025 07:21 AM

Hi anyone is learning this can answer my question? Thanks in advance! SAS PG3 material, hash method, in the code there is always these one or two statements: if _n_=1 then do; and if 0 then set;, I have two questions: (1) what does the statement mean and what do they do? (2) Do I always need to write these two statements when using hash method?

Many thanks!

Tom · Posted 07-04-2025 05:47 PM

I word of advice. Do not look for rules that you always apply. Instead UNDERSTAND what the code is doing. Then you can invent new patterns for using the commands.

(1) The IF _N_=1 THEN DO is only needed if there are statements you want to execute only once. That is you don't want them executing on every iteration of the data step. Remember than a normal data step (data...;set...;...;run;) will iterate once for each observation read by the SET statement. Note that the IF _N_=1 construct is also useful when NOT using HASH objects so make sure you understand what it is doing.

When reading the documentation make sure to check what the DEFAULTS are. For the .FIND() method of the HASH object the default is use the same variable name(s) as where set by the .DEFINEKEY() method to find the values of the KEYs to search for in the HASH object. So the only time you NEED to pass inputs to the .FIND() method is when you want a different value.

(2) CALL MISSING() is a completely different statement than a SET statement. It just sets the listed variables values to missing. To understand when you need to use it you need to understand which variables a data step will reset to missing when a new iteration begins. Any variable that is sourced from dataset (such as the one referenced in a SET statement) does NOT have it values reset to missing. So to avoid having the values remembered from the last time you use the .FIND() method to look for a matching KEY it is frequently valuable to use CALL MISSING() before the .FIND() method call. Or execute it conditionally when the .FIND() method fails to find the value in the HASH object. But there are other situations where you would want the previous value remembered (retained). You need to understand what you are trying to do to know when to use that statement.

And placing it immediately after the .DEFINEDONE() method call is not normally needed since you typically only run that statement once.

(3) Yes. If you define a HASH object to use a variable that your data step never defines it will cause that first error. You do NOT have to use a SET statement to define the variables however. It depends on what you are trying to do.

That second error message is because of trying to define the same variable has both numeric and character. You can get that error message in many other situations that have nothing to do with HASH objects. For example when using two dataset that have the a variable the the same name but ther are defined differently.

View solution in original post

Tom · Posted 07-04-2025 08:08 AM

Let's answer in reverse. The answer to (2) is NO. There are many times where such things are not needed. It depends on what you are doing.

In the first example you need to know how the data step sets the value of the _N_ automatic variable. At the start of each iteration of the data step _N_ is set to the count of how many times the data step has iterated. So testing if _N_=1 is checking if the this is the FIRST iteration of the data step. For working with HASH objects that can be important because you generally only want to DECLARE or define the hash object once.

In the second example you need to know how SAS treats numeric values as BOOLEAN values. It will treat 0 or missing values are FALSE and any other value as TRUE. That means in this example the SET statement will never EXECUTE. But during the compilation of the data step the SET statement will be seen and the variables that exist in the referenced dataset(s) will be created. Yet no values will actually be read in by the SET statement since it never executes. Also since the SET never executes it avoids the problem that a dataset with zero observations might cause. Remember that most data step steps end when they read past the end of input data. So actually executing a SET statement even once on a dataset with zero observations would stop the data step immediately. (Note since the SET statement never executes there is no reason the IF 0 statement needs to be inside the IF _N_=1 DO block. )

One reason why it is necessary to use this IF 0 THEN SET .... ; trick when using HASH objects is so you can reference the variables from that dataset in the code. Remember that the data step compiler will define variables as it sees them while compiling the code. It defines the variables type (and storage length for character variables) based on the context where it first sees them. When it cannot guess from the context then the variable will default to numeric.

Since most places where you reference variable names in HASH object methods you are using character expressions (h.definedkey('id') for example) the data step compiler does not see those as variable references, just strings. So there is no indication that such a variable should exist in the data step. So code referencing the ID variable might end up accidentally defining ID as numeric in the data step when it was character in the dataset being read. Placing the SET statement at the top of the program to define all of the variables avoids this. And the IF 0 THEN avoids actually reading the data at that point.

dxiao2017 · Posted 07-04-2025 04:22 PM

Thanks a lot for reply, @Tom ! So this is what I understand now:

(1) The if _n_=1 then do; statement is ALWAYS necessary in hash method, i.e., the basic syntax of hash method could be the follows (the scenario is I have two datasets, dsn1 and dsn2, I would like to combine them base on a key column, column1) , am I right or wrong? Another question is that, for the rc=h.find(key: column1); statement, I see there are examples such as rc=h.find(); , i.e., there is no key column specified inside the brackets, my question is: when do I need to specify the key:column in this statement?

data combine;
   if _n_=1 then do;
      declare hash h(dataset: 'dsn1');
      h.definekey('column1');
      h.definedata('column2','column3');
      h.definedone();
   end;
   set dsn2;
      rc=h.find(key: column1);
      if rc=0;
run;

(2) The if 0 then set; is NOT always necessary, but in a lot of cases I need to write this statement. And it can be replaced by call missing (column1,column2,column3); statement after the h.definedone(); statement, am I right?

(3) When writing hash codes and do NOT write if 0 then set; I sometimes get error message such as: (a) ERROR: Undeclared data symbol columnxxx for hash object, and (b) ERROR: Variable columnxxx has been defined as both character and numeric. Does this mean, always add the if 0 then set; statement in my hash code if I do not know whether I need to write it?

Tom · Posted 07-04-2025 05:47 PM

I word of advice. Do not look for rules that you always apply. Instead UNDERSTAND what the code is doing. Then you can invent new patterns for using the commands.

(1) The IF _N_=1 THEN DO is only needed if there are statements you want to execute only once. That is you don't want them executing on every iteration of the data step. Remember than a normal data step (data...;set...;...;run;) will iterate once for each observation read by the SET statement. Note that the IF _N_=1 construct is also useful when NOT using HASH objects so make sure you understand what it is doing.

When reading the documentation make sure to check what the DEFAULTS are. For the .FIND() method of the HASH object the default is use the same variable name(s) as where set by the .DEFINEKEY() method to find the values of the KEYs to search for in the HASH object. So the only time you NEED to pass inputs to the .FIND() method is when you want a different value.

(2) CALL MISSING() is a completely different statement than a SET statement. It just sets the listed variables values to missing. To understand when you need to use it you need to understand which variables a data step will reset to missing when a new iteration begins. Any variable that is sourced from dataset (such as the one referenced in a SET statement) does NOT have it values reset to missing. So to avoid having the values remembered from the last time you use the .FIND() method to look for a matching KEY it is frequently valuable to use CALL MISSING() before the .FIND() method call. Or execute it conditionally when the .FIND() method fails to find the value in the HASH object. But there are other situations where you would want the previous value remembered (retained). You need to understand what you are trying to do to know when to use that statement.

And placing it immediately after the .DEFINEDONE() method call is not normally needed since you typically only run that statement once.

(3) Yes. If you define a HASH object to use a variable that your data step never defines it will cause that first error. You do NOT have to use a SET statement to define the variables however. It depends on what you are trying to do.

That second error message is because of trying to define the same variable has both numeric and character. You can get that error message in many other situations that have nothing to do with HASH objects. For example when using two dataset that have the a variable the the same name but ther are defined differently.

question about hash method, SAS PG3

Re: question about hash method, SAS PG3

Re: question about hash method, SAS PG3

Re: question about hash method, SAS PG3

Re: question about hash method, SAS PG3

question about hash method, SAS PG3

Re: question about hash method, SAS PG3

Re: question about hash method, SAS PG3

Re: question about hash method, SAS PG3

Re: question about hash method, SAS PG3

Registration is open