BookmarkSubscribeRSS Feed
Quentin
Super User

I'm working my way through Dorfman & Henderson's excellent hash treatise, https://www.amazon.com/Management-Solutions-Using-Table-Operations/dp/1629601438.

 

They make the point that when a hash method throws an error, it will happily throw an unlimited number of errors until there are no more errors to throw (or perhaps the log has filled up).  Seems to me it would be much better if it honored the ERRORS option, which is designed to limit the number of errors thrown.  

 

So currently, below code throws 20 duplicate key errors, even though the ERRORS options is set to 3:

 

option errors=3 ;

data _null_ ;
  dcl hash h() ;
  h.definekey("id") ;
  h.definedone() ;
  id=1 ;
  do i=1 to 21 ;
    h.add() ;     
  end ;
run ;

Wouldn't it be better to honor the ERRORS option, i.e. throw three errors and then throw "WARNING: Limit set by ERRORS= option reached. Further errors of this type will not be printed." ?  

 

Is there any benefit to throwing unlimited error messages? 

 

Yes, I know I can prevent any error messages from being thrown by using rc=h.add()  instead of just h.add(), but since I'm a fan of offensive programming, I see it as a great feature that h.add() will throw an error if it can't add a record.

10 REPLIES 10
novinosrin
Tourmaline | Level 20

Good afternoon @Quentin Sir and i hope you are enjoying your sunday afternoon

"I'm working my way through Dorfman & Henderson's excellent hash treatise" -   That makes two of us among the many unknown

Sir, Nice point. However my intuition is  that's the situation why  those two guru's are  making us pay attention to unassigned vs assigned call. 

A plain call 

 

 h.add() ; 

 

does cause such problems unless the same on an if condition like if h.add() is considered assigned. 

Therefore, the safe best is always to use assigned at least by ordinary blokes like me unlike you at all times. And so, 

 

 rc=h.add() ; 

 

works well with no error dup messages in log like 

 

data _null_ ;
  dcl hash h() ;
  h.definekey("id") ;
  h.definedone() ;
  id=1 ;
  do i=1 to 21 ;
    rc=h.add() ;     
  end ;
h.output(dataset:'want');
stop;
run ;

 

 

 

 

 

 

Quentin
Super User

Thanks @novinosrin, but I really like the unassigned call.  I like that it throws an error when it fails, because it provides automatic built in error-detection.  I like that the accidental duplicate below throws an error.

data _null_ ;
  if _n_=1 then do ;
    dcl hash h() ;
    h.definekey("id") ;
    h.definedone() ;
  end ;
  input id ;
  h.add() ; 
  put _n_= id= ;
  if id=10 then h.output(dataset:'want');
cards ;
1
2
3
3
5
6
7
8
9
10
; 
run ;

If I write a step with an unassigned add() and the step doesn't error, I know that every add() succeeded. I'd much rather get an error than let an add() silently fail.

 

 

It's just that I don't like the possibility of flooding my logs with hundreds (or thousands) of errors.  

 

With just about any other error, SAS protects you from flooding the log.  For example below, you only get three error messages even though every iteration of the DATA step could throw an error:

 

1    options errors=3 ;
2
3    data _null_ ;
4      set sashelp.prdsale ;
5      y=country*1 ;
6    run ;

NOTE: Character values have been converted to numeric values at the places given by:
      (Line):(Column).
      5:5
NOTE: Invalid numeric data, COUNTRY='CANADA' , at line 5 column 5.
ACTUAL=$925.00 PREDICT=$850.00 COUNTRY=CANADA REGION=EAST DIVISION=EDUCATION
PRODTYPE=FURNITURE PRODUCT=SOFA QUARTER=1 YEAR=1993 MONTH=Jan y=. _ERROR_=1 _N_=1
NOTE: Invalid numeric data, COUNTRY='CANADA' , at line 5 column 5.
ACTUAL=$999.00 PREDICT=$297.00 COUNTRY=CANADA REGION=EAST DIVISION=EDUCATION
PRODTYPE=FURNITURE PRODUCT=SOFA QUARTER=1 YEAR=1993 MONTH=Feb y=. _ERROR_=1 _N_=2
NOTE: Invalid numeric data, COUNTRY='CANADA' , at line 5 column 5.
WARNING: Limit set by ERRORS= option reached.  Further errors of this type will not be
         printed.
ACTUAL=$608.00 PREDICT=$846.00 COUNTRY=CANADA REGION=EAST DIVISION=EDUCATION
PRODTYPE=FURNITURE PRODUCT=SOFA QUARTER=1 YEAR=1993 MONTH=Mar y=. _ERROR_=1 _N_=3
NOTE: Missing values were generated as a result of performing an operation on missing
      values.
      Each place is given by: (Number of times) at (Line):(Column).
      1440 at 5:12
NOTE: There were 1440 observations read from the data set SASHELP.PRDSALE.

In fact, I've been playing around trying to flood the log with error messages or bad notes some other way, and couldn't do it.  Even something like:

data _null_ ;
  do i=1 to 10000 ;
    x='A'+1 ;
  end ;
run ;

will automatically stop throwing notes after it has thrown 100 of them.  

Quentin
Super User

Also, it looks to me like maybe SAS is trying to stop the step when it throws an error?  Below is the log from my step where there are duplicates for ID=3, so the add() throws an error:

 

1    data _null_ ;
2      if _n_=1 then do ;
3        dcl hash h() ;
4        h.definekey("id") ;
5        h.definedone() ;
6      end ;
7      input id ;
8      h.add() ;
9      put _n_= id= ;
10     if id=10 then h.output(dataset:'want');
11   cards ;

_N_=1 id=1
_N_=2 id=2
_N_=3 id=3
ERROR: Duplicate key.
_N_=4 id=3
_N_=5 id=5
_N_=6 id=6
_N_=7 id=7
_N_=8 id=8
_N_=9 id=9
_N_=10 id=10
NOTE: The data set WORK.WANT has 9 observations and 1 variables.
NOTE: The SAS System stopped processing this step because of errors.
22   ;

 

There is a note there claiming that the step stopped early because of errors.  That would be great, if it were true, because it would protect you from a flood of errors.  But it's also clear from the log that the step did NOT stop, it kept iterating despite the errors.  Even though the error happens on _N_=4, the step keeps iterating, and executing all the code on each iteration, including writing the hash table to the output dataset.

Tom
Super User Tom
Super User

I suspect that whoever at SAS added the hash object functionality to the SAS datastep did not really understand how SAS data steps work and the meaning and usage of the ERROR option.  Note that it also does not set the _ERROR_ automatic variable. 

 

Why not just trap and count the errors yourself?

188   %let save=%sysfunc(getoption(error));
189   options error=3;
190   data _null_ ;
191     dcl hash h() ;
192     h.definekey('id') ;
193     h.definedone() ;
194     id=1 ;
195     do i=1 to 21 ;
196       rc=h.add();
197       if rc ^=0 then do;
198          put 'ERROR: Unable to add to hash';
199          n_error + 1;
200          if n_error > input(getoption('error'),32.) then do;
201             put 'Note: Stopping do to H.ADD() errors';
202            stop;
203          end;
204       end;
205     end ;
206     stop;
207   run ;

ERROR: Unable to add to hash
ERROR: Unable to add to hash
ERROR: Unable to add to hash
ERROR: Unable to add to hash
Note: Stopping do to H.ADD() errors

Perhaps it would be easier for them to just make it a syntax error to call H.ADD() without giving the return code a place to go?  Just like I cannot call other functions as if they were statements.

208   data x;
209    mean(1,2,3);
                  -
                  22
                  76
ERROR: Undeclared array referenced: mean.
ERROR 22-322: Syntax error, expecting one of the following: +, =.

ERROR 76-322: Syntax error, statement will be ignored.

210   run;

NOTE: The SAS System stopped processing this step because of errors.

 

 

Quentin
Super User

Good point about it not setting _ERROR_, @Tom.  Maybe there is hope that SAS will improve the error handling in time.

 

Yes, you could use the assigned method call, and trap and count the errors yourself.  So basically you code an assertion that the return code is 0.  I already have an %assert macro that limits the number of error messages thrown when an assertion fails, so could do it like:

 

  id=1 ;
  do i=1 to 21 ;
    rc=h.add() ;
    %assert((rc=0))
  end ;

Or slightly more compact but perhaps less readable:

 

  id=1 ;
  do i=1 to 21 ;
    %assert(h.add()=0)
  end ;

And that's fine.  But when I first learned about the unassigned method call, i.e. H.ADD() without a place for the return code to go, I thought "cool, I don't have to code an assertion myself, because if the method fails, I'll get an error automatically."  And I happily use the unassigned method call a lot, whenever I expect that a method should never fail.  But I didn't realize I was risking generating an unlimited flood of error messages.  I had just assumed it would honor the errors option, until I read in Paul and Don's book that it didn't, and I thought "that can't be true..."

ChrisNZ
Tourmaline | Level 20

The reason is that, as for all object tasks, the data step is blissfully unaware of what the object does, including throwing errors.

data _null_ ;
  dcl hash h() ;
  h.definekey("id") ;
  h.definedone() ;
  id=1 ;
  do i=1 to 21 ;
    h.add() ; 
    putlog _ERROR_=;   
  end ;
run ;

_ERROR_ is never set to 1.

Using RC= is the recommended way, and you can set _ERROR_ manually if needed.

 

Quentin
Super User

@ChrisNZ, do you think it's a good thing that _ERROR_ is not set and the ERRORS option is not honored?  I think I'm inclined to agree with @Tom's suggestion that this may be an oversight.  I can't see any benefit to this behavior.  I suppose it wouldn't be a problem if the step actually stopped when it errored.    But I've seen many cases where despite the fact that the log states that a step stopped early due to errors, it actually ran to completion.  That seems like a bug to me.

 

1    data _null_ ;
2      dcl hash h() ;
3      h.definekey("id") ;
4      h.definedone() ;
5      id=1 ;
6      do i=1 to 3 ;
7        h.add() ;
8        putlog i= _ERROR_=;
9      end ;
10     put "I ran all the way" ;
11   run ;

i=1 _ERROR_=0
ERROR: Duplicate key.
i=2 _ERROR_=0
ERROR: Duplicate key.
i=3 _ERROR_=0
I ran all the way
NOTE: The SAS System stopped processing this step because of errors.

I'm tempted to put in a ballotware item to suggest that hash methods should set _ERROR_, and should honor the ERRORS system option.  Would people disagree with that suggestion?

 

 

ChrisNZ
Tourmaline | Level 20

It is not an oversight.

_ERROR_ is not set for the same reason you have to call missing() the hash variables: the data step doesn't see what the object does.

 

Whether it should be different is another discussion. But we are talking about the separation between object and date step here, not just about the _ERROR_ variable. You raise a much wider matter, of which the _ERROR_ behaviour is just a small symptom.

 

 

Quentin
Super User

@ChrisNZ wrote:

It is not an oversight.

_ERROR_ is not set for the same reason you have to call missing() the hash variables: the data step doesn't see what the object does.

 

Whether it should be different is another discussion. But we are talking about the separation between object and date step here, not just about the _ERROR_ variable. You raise a much wider matter, of which the _ERROR_ behaviour is just a small symptom.

 

 


I don't think that's right.  You have to initialize the host variables in the PDV (with call missing or whatever other way) because the arguments to the hash methods are not seen at data step compile time, they are seen at data step execution time.  This design decision had a lot of benefits, e.g. it allows you to use any character expression to define an argument.  So you don't have to hard code the names of keys, output data sets, etc.

 

The hash methods are of obviously capable of writing values to variables in the PDV, as this is one of their main features.  Since _ERROR_ is just another variable in the PDV, I would think it would be straight forward for a hash method to set _ERROR_=1 when the method throws an error. 

 

And how to understand the misleading note "NOTE: The SAS System stopped processing this step because of errors." when in fact the step was not stopped?  Is this a sign that the developer was hoping the error would stop the step? Or is this a note that was accidentally triggered somewhere within the bowels of SAS error handling?

 

ChrisNZ
Tourmaline | Level 20

Fair points. Maybe we are saying the same thing:

- The only thing the data step can do with objects is call methods. The rest is totally invisible.

- Likewise the only way the hash object can interact back is by setting values in the PDV. I don't think the odsout object does even that.

 

In that sense, it would indeed make sense that _ERROR_ is set in case of errors. You convinced me. Bring on the ballot entry! 🙂

 

 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 2187 views
  • 7 likes
  • 4 in conversation