In the following minimal example of declaring and using a hash object inside a data step, my understanding is that:
You have to include both "if 0 then set A;
" and "call missing(k);
" , if you don't want SAS to complain with "ERROR: Undeclared key symbol for hash object at line 31 column 7."
Can someone please confirm my understanding? Or am I confusing two different concepts? Thanks in advance.
@PeterClemmensen Would highly appreciate your input since this post was a great help!
SAS Version: SAS Enterprise Guide 8.3 Update 8 (64-bit)
data _null_;
if 0 then set A
if _N_ = 1 then do;
declare hash h(dataset:'A');
h.defineKey('k');
call missing(k);
end;
set B;
if h.find()=0 then put "Key exists in A: " k=;
run;
No. Both are not required. And for that particular example neither is needed since the variable K used in the HASH object must already exist in dataset B for the code to make any sense.
81 data a b ; 82 do k=1 to 2; output a; end; 83 do k=1 to 4; output b; end; 84 stop; 85 run; NOTE: The data set WORK.A has 2 observations and 1 variables. NOTE: The data set WORK.B has 4 observations and 1 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.01 seconds 86 87 data _null_; 88 if _N_ = 1 then do; 89 declare hash h(dataset:'A'); 90 h.defineKey('k'); 91 h.defineDone(); 92 end; 93 94 set B; 95 if h.find()=0 then put "Key exists in A: " k=; 96 else put "Key does NOT exist in A: " k=; 97 run; NOTE: There were 2 observations read from the data set WORK.A. Key exists in A: k=1 Key exists in A: k=2 Key does NOT exist in A: k=3 Key does NOT exist in A: k=4 NOTE: There were 4 observations read from the data set WORK.B. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.01 seconds
The purpose of the unexecuted SET statement is to let the data step compiler see the variables you are using from that dataset. That is needed because any variables you are referencing in the DEFINEKEY() and DEFINEDATA() method calls are "invisible" to the data step compiler. They just look like sting literals. So if those variables are never defined some where in the data step then you get a RUN TIME error when the hash object is created.
4 data test; 5 set sashelp.class(obs=1); 6 declare hash h(); 7 h.definekey('fred'); 8 h.definedone(); 9 run; ERROR: Undeclared key symbol fred for hash object at line 8 column 3. ERROR: DATA STEP Component Object failure. Aborted during the EXECUTION phase. NOTE: The SAS System stopped processing this step because of errors. NOTE: There were 1 observations read from the data set SASHELP.CLASS. WARNING: The data set WORK.TEST may be incomplete. When this step was stopped there were 0 observations and 5 variables. WARNING: Data set WORK.TEST was not replaced because this step was stopped. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.00 seconds
You don't need to use the SET statement to define the variables. You could actual define the variables using LENGTH or ATTRIB statements.
One use of the CALL MISSING() statement is to avoid a message about a variable not be initiated. There are other ways to do that. You could initiate the variable with an assignment statement instead. Or you could include the variable in an array definition. The use of CALL MISSING() just make it easier.
But another use of CALL MISSING() statement is important when using this type of HASH() lookup. And that is using CALL MISSING() for its original purpose, to set variable to missing values. You should use that when the FIND() method did not find an entry. That is especially important when using the unexecuted SET statement to define the variable because variables that appear in source datasets are RETAINED (that is not set to missing at start of each iteration of the dataset loop). So to prevent lookup values from being carried forward it is important to clear those values when the lookup fails.
Try this example:
data newclass ;
length name $8 ;
do name='Alfred','Sam';
output;
end;
run;
data want;
if 0 then set sashelp.class(keep=name age);
if _N_ = 1 then do;
declare hash h(dataset:'sashelp.class');
h.defineKey('name');
h.definedata('age');
h.defineDone();
end;
set newclass;
put name= @;
if h.find()=0 then put age=;
else do;
call missing(age);
put age= " Could not find age";
end;
run;
proc print;
run;
Now try running it without the CALL MISSING() statement. What value of AGE did you get for NAME="Sam"?
Now add remove the IF 0 ... statement and add a LENGTH AGE 8; statement after the SET NEWCLASS statement. What value do you get now?
Your code won't work because of a missing semicolon 😉
The purpose of
if 0 then set a;
is to make the data step compiler fetch the dataset's metadata and include the variables in the PDV; this statement automates the definition of the variables for the rest of the DATA step. You therefore do not need a LENGTH/FORMAT statement.
CALL MISSING, OTOH, avoids the NOTE about uninitialized variables.
Maxim 4: try your code with and without these statements to see what happens.
Thank you so much, @Kurt_Bremser !!!
I will certainly take a look at those maxiums and try to apply them.
And yes! the forsaken semi colon 🙂 bane of my existence
No. Both are not required. And for that particular example neither is needed since the variable K used in the HASH object must already exist in dataset B for the code to make any sense.
81 data a b ; 82 do k=1 to 2; output a; end; 83 do k=1 to 4; output b; end; 84 stop; 85 run; NOTE: The data set WORK.A has 2 observations and 1 variables. NOTE: The data set WORK.B has 4 observations and 1 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.01 seconds 86 87 data _null_; 88 if _N_ = 1 then do; 89 declare hash h(dataset:'A'); 90 h.defineKey('k'); 91 h.defineDone(); 92 end; 93 94 set B; 95 if h.find()=0 then put "Key exists in A: " k=; 96 else put "Key does NOT exist in A: " k=; 97 run; NOTE: There were 2 observations read from the data set WORK.A. Key exists in A: k=1 Key exists in A: k=2 Key does NOT exist in A: k=3 Key does NOT exist in A: k=4 NOTE: There were 4 observations read from the data set WORK.B. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.01 seconds
The purpose of the unexecuted SET statement is to let the data step compiler see the variables you are using from that dataset. That is needed because any variables you are referencing in the DEFINEKEY() and DEFINEDATA() method calls are "invisible" to the data step compiler. They just look like sting literals. So if those variables are never defined some where in the data step then you get a RUN TIME error when the hash object is created.
4 data test; 5 set sashelp.class(obs=1); 6 declare hash h(); 7 h.definekey('fred'); 8 h.definedone(); 9 run; ERROR: Undeclared key symbol fred for hash object at line 8 column 3. ERROR: DATA STEP Component Object failure. Aborted during the EXECUTION phase. NOTE: The SAS System stopped processing this step because of errors. NOTE: There were 1 observations read from the data set SASHELP.CLASS. WARNING: The data set WORK.TEST may be incomplete. When this step was stopped there were 0 observations and 5 variables. WARNING: Data set WORK.TEST was not replaced because this step was stopped. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.00 seconds
You don't need to use the SET statement to define the variables. You could actual define the variables using LENGTH or ATTRIB statements.
One use of the CALL MISSING() statement is to avoid a message about a variable not be initiated. There are other ways to do that. You could initiate the variable with an assignment statement instead. Or you could include the variable in an array definition. The use of CALL MISSING() just make it easier.
But another use of CALL MISSING() statement is important when using this type of HASH() lookup. And that is using CALL MISSING() for its original purpose, to set variable to missing values. You should use that when the FIND() method did not find an entry. That is especially important when using the unexecuted SET statement to define the variable because variables that appear in source datasets are RETAINED (that is not set to missing at start of each iteration of the dataset loop). So to prevent lookup values from being carried forward it is important to clear those values when the lookup fails.
Try this example:
data newclass ;
length name $8 ;
do name='Alfred','Sam';
output;
end;
run;
data want;
if 0 then set sashelp.class(keep=name age);
if _N_ = 1 then do;
declare hash h(dataset:'sashelp.class');
h.defineKey('name');
h.definedata('age');
h.defineDone();
end;
set newclass;
put name= @;
if h.find()=0 then put age=;
else do;
call missing(age);
put age= " Could not find age";
end;
run;
proc print;
run;
Now try running it without the CALL MISSING() statement. What value of AGE did you get for NAME="Sam"?
Now add remove the IF 0 ... statement and add a LENGTH AGE 8; statement after the SET NEWCLASS statement. What value do you get now?
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.