Debugging logic errors in SAS data step programs can be challenging. However, there are lots of well-known and established techniques, including, but not limited to:
using the PUT statement to display the values of key variables in the SAS log
using the SAS debugger
Unfortunately the contents of a SAS hash object (aka a hash table) can not be displayed to the log using a PUT statement; likewise the debugger does not recognize the SAS hash object - the ex(amine) debugger statement displays this message when used with a SAS hash object pointer: Cannot print object type
As Paul (aka @hashman) and I (aka @DonH) worked on our SAS Press Book Data Management Solutions Using SAS® Hash Table Operations: A Business Intelligence Case Study, we used some tips and tricks to facilitate the debugging of the examples in the book. We would like to share a couple of those in this article.
So let us first create an example program that is not producing the expected results. As an aside, it is not an easy task to create a program that is wrong on purpose in a subtle way. So you will have to bear with our example program whose error may be obvious to you.
We want to use the hash object to create aggregates using a class variable(s) and several analysis variables. The wrinkle is that the data are not sorted by the class variable. And yes, we know that there are lots of SAS procedures that do this out of the box (but see the immediately preceding comment about creating programs that have non-obvious bugs). Specifically we want to summarize the SASHELP.CLASS data set to create an output dataset with the sum of the height and weight variables for each value of the variable Sex. The following program is our first attempt to do this:
data _null_;
if _n_ = 1 then
do;
dcl hash genderSum(ordered:"A");
genderSum.defineKey("Sex");
genderSum.defineData("Sex","Count","Height_Sum","Weight_Sum");
genderSum.defineDone();
end;
do until(last.sex);
set sashelp.class end=lr;
by sex notsorted;
if first.sex then call missing(Count,Height_Sum,Weight_Sum);
Count + 1;
Height_Sum + Height;
Weight_Sum + Weight;
end;
genderSum.replace();
if lr then genderSum.output(dataset:"Sums");
run;
We define the hash object on the first execution of the DATA step and then use a DoW loop to reach each group of data with the same value for the variable Sex. As is typical for first. and last. processing, we make sure to initialize our sum variables to missing when encountering a new value for the variable Sex and use SUM statements to create the needed aggregates. At the end of each BY group we use the REPLACE method to update our hash table. The REPLACE method will add a row if the key (in this case SEX) is not found, otherwise it will update the current row with the values of the PDV host variables.
Upon running this program we can see that the results seem to be wrong:
What we would like to do in order to debug this is to examine the contents of the hash table pointed to by genderSum after each BY group. The question is how? One approach is to use the OUTPUT method with a unique name for each data set. Since the DoW loop executes once for each BY group, the variable _N_ (the DATA step execution counter) does that. So we add an OUTPUT method call after the call to the REPLACE method. Note that the name of the output data set is defined as a character expression - thus, allowing us to create separate data sets:
genderSum.replace();
genderSum.output(dataset:cats("Sums",_n_));
if lr then genderSum.output(dataset:"Sums");
This will cause data sets Sums1, Sums2, . . . . to be created.
As a result of reviewing this output, the first sign of trouble is the Sums4 data set. The value of Count should be increasing. The last output data set (obviously) matches our original output. Our next step is to examine the source data and discover that the last F group has 4 rows and the last M group has 5. Thus the problem is an issue with initialization. Upon reviewing our logic we recognize that resetting the totals at FIRST.SEX does not work as expected. Upon encountering a group already in the hash table, we want the find method to load the current cumulative values into the PDV host varialbles.
The issue is that since the data are grouped, but not sorted, the FIRST. check is resetting the values to missing with each new group. We only want to initialize the values to missing the first time each group (e.g., value of SEX is encountered). And if we are reading data for a different value of Sex we need to copy the values from the hash table to the PDV host variables Count, Height_Sum and Weight_Sum.
So we replace the IF statement that is invoking the MISSING function as follows:
if first.sex and genderSum.find() ne 0 then call missing(Count,Height_Sum,Weight_Sum);
We immediately discover that the data are still off:
So at this point we decide to use the debugger to check the values of the PDV host variables and discover that due to the ordering of the FIND, REPLACE and OUTPUT methods we are still overwriting the values for Count, Height_Sum and Weight_Sum. The issue is that genderSum.find() is executed as each observation is read. So our sum variables do not retain the values from the previous row if it had the same value for the variable Sex.
We need to initialize the sum variables to missing on the first execution of the DATA step for each value of Sex, but we don't want the FIND method to overwrite the sum PDV host variables on other rows. That is where the CHECK method comes into play. Use the CHECK method to determine if the values should be set to missing because we are encountering a new value (i.e., not in the hash table) of our class variable Sex; and the FIND method otherwise to load the current cumulative values into the PDV host variables. Our final (and correct) program follows:
data _null_;
if _n_ = 1 then
do;
dcl hash genderSum(ordered:"A");
genderSum.defineKey("Sex");
genderSum.defineData("Sex","Count","Height_Sum","Weight_Sum");
genderSum.defineDone();
end;
do until(last.sex);
set sashelp.class end=lr;
by sex notsorted;
if first.sex then
if genderSum.check() ne 0 then call missing(Count,Height_Sum,Weight_Sum);
else genderSum.find();
Count + 1;
Height_Sum + Height;
Weight_Sum + Weight;
end;
genderSum.replace();
if lr then genderSum.output(dataset:"Sums");
run;
Before closing, one last tip on using the OUTPUT method when debugging. Recall the code where we used a distinct name for the name of the output data set so we could see the contents of the hash object after each group of data was processed:
genderSum.replace();
genderSum.output(dataset:cats("Sums",_n_));
if lr then genderSum.output(dataset:"Sums");
Instead of a unique name you can use the OUTPUT method with a fixed name and use the debugger to pause after each group of data are processed. Since there is one execution of the DATA step per BY group, you can WATCH _N_ and when the debugger caused the program to pause, just double click on the data set name in the explorer window to see the contents.
genderSum.replace();
genderSum.output(dataset:"Interim")
if lr then genderSum.output(dataset:"Sums");
This screenshot of the Log, Debugger and Explorer Window highlights how the Interim data set is output after each BY group.
... View more