BookmarkSubscribeRSS Feed

When I make a format, about 99.9% of the time I expect it to be exhaustive of the values that will be encountered.  It would be nice to be able to define a format that would throw an error to the log if a value was not found.

 

For example:

 

proc format;
  value sex
  1='F'
  2='M'
  other=_ERROR_  /*pseudo code*/
;
run;

data have;
  input sex ;
  cards;
1
1
3
;
run;

 

When the format is used, it would throw an error to the log.  So below PROC step and DATA step would both throw errors, ideally listing the invalid value:

 

proc freq data=have;
  tables sex;
format sex sex.; run; data want; set have; sexC=put(sex,sex.); run;

 

I know there are alternatives, e.g.,

data want;
  set have;
  sexC=put(sex,sex.);
  if sexC="_ERROR_" then put "ERROR: invalid value " sex=;
run;

But it would be nice to have a way to say when a format is created, "if any value is outside the domain of values defined in the format, throw an error [or warning or note]."

 

9 Comments
RW9
Diamond | Level 26
Diamond | Level 26

But then wouldn't the reverse then be a problem?  If you have a variable where most formats are known but, others might not, take lab data:

Test a

Test b

Unscheduled Test c

 

Now I have a format which formats test a and test b, but c is not a scheduled one, I wouldn't necessarily want errors/warnings pushed out to the log for this as I would want the ones that match to code, the ones that don't to be as is.  So then there would need to be an option to hide warnings/erros which I am not fond of.

 

Generally speaking however, since the 32bit vs 64bit debacle I avoid the use of formats and any other proprietary binary files as much as possible, so would have a code variable and decode variable and select() or case block the new variable in (i.e. to have both)

Quentin
Super User

Thanks @RW9 for the thoughts.  I agree that backwards compatibility would be key.  I would not expect other=_error_ to become a new default.  Instead, it would just be a new option for people to add other=_error_ to format definitions when they want this sort of error detection.

 

Or perhaps an alternative would be a system option, FormatMissingValue=Error|Warning|Note|Nonote.  Since honestly, usually I would want to have an error any time there is a value that does not fit into one of my format categories.

 

Agree that permanent format catalogs are often too much of a hassle to maintain than I find worthwhile.  But I still rely on formats a LOT.  I just simply keep a .sas file with the format definitions to be loaded whenever needed, or better yet a SAS dataset that can be used to create a format using PROC FORMAT cntlin. 

ballardw
Super User

Maybe turn the problem inside-out: use a custom informat (invalue statement in proc format) to verify the data when reading it. Then other=_error_ is legal.

 

Or possibly this somewhat kludgy approach would work for you:

proc format library=work;
  value sex
  1='F'
  2='M'
;
  invalue sex
  1,2=_same_
  other=_ERROR_ 
;
run;

data have;
  input sex ;
  cards;
1
1
3
;
run;
data want;
  set have;
  sexC=put(input(sex,sex.),sex.);
run;

This generates 

NOTE: Numeric values have been converted to character values at the places given by:
      (Line):(Column).
      680:18
NOTE: Invalid argument to function INPUT at line 680 column 12.
sex=3 sexC=. _ERROR_=1 _N_=3
NOTE: Mathematical operations could not be performed at the following places. The results of the
      operations have been set to missing values.
      Each place is given by: (Number of times) at (Line):(Column).
      1 at 680:12

because the Input implies a character value.

 

Quentin
Super User

Thanks @ballardw that's a nifty approach, but I'm not really thinking of this as a data validation problem (where my goal is to validate that all the data are in the expected domain). 

 

Instead I'm just thinking of this as an opportunity for enhanced error detection at run time, without needing to modify code that uses the format.  If I write an if-then-else block or CASE statement, I always end with ELSE PUT "ERROR: INCOMPLETE LOGIC"; or some-such. So you get real-time error detection as the code run, at the cost of adding one statement.

 

It would be nice to have some "insurance" at run time that if I left a value out of a format definition (or if a new unexpected value somehow snuck in), I would get an error from my code, rather than the value formatted as $best or whatever.  Without having to change the code itself.

ChrisHemedinger
Community Manager

@Quentin I can't speak to the feasibility of your suggestion.  But what would you expect when OPTIONS NOFMTERR is set?  Currently, an error is generated when you reference a format that isn't in the search path.  But EG and other clients set NOFMTERR because the default setting (throwing the error) prevents you from opening data if not all formats are present.

 

Put another way, what would you expect a data viewer (VIEWTABLE, EG, SAS Studio) to do if showing data with an "error" value in the table view?

Quentin
Super User

Thanks @ChrisHemedinger.

 

If NOFMTERR is set, and the format doesn't exist in the search path, I would not expect the other=_error_ to do anything.  That is, since the format isn't known to the SAS session, it can't be used at all, and you end up with a default best. format.  I don't think that is an issue.

 

What would I expect a data viewer to show if an "invalid" value is displayed?  I think I would want it to display whatever it displays now (using whatever default best. format that is).  I don't think I would want it to display "ERROR".  If it is possible for  these viewers to write log messages when they open a table, I would want them to write ERROR: s to the log each time an invalid value is encountered (until the ERRORS limiit is reached).

 

Simarly, I would not want put(3, sex.) to return "_ERROR_" (as it does now, because PROC format assumes I just forgot quote marks.  I would want it (I think) to still return "3" (just like it would if there were no other= defined in the format), I would just want the ERROR message thrown to the log.

 

So for the sake of backwards compatibility, I woul not want to have other=_error_ change how invalid values are actually displayed/formatted.  It would simply be a way to generate an error in the log when invalid values are encountered.  "invalid" meaning undefined in the format.

 

Make sense?

ChrisHemedinger
Community Manager

@Quentin Yes, your explanation makes sense.  It's just that you have to be careful what you ask for, as the act of "generating an error in the log" can have more implications than just providing information.  An ERROR can cause the current step to stop what it's doing, can throw an exception that a client app would need to handle, can set various error-related flags that influence downstream processing, etc.

 

It sounds like you're looking for PROC FORMAT to help fill the gap for data quality issues.  There are other practices (range checking, data characteristic checks, etc) that can be used for that.  I know you're aware of these and are looking for something that's a little more "built in", but I'm not sure this particular enhancement would be enough to cover all such gaps.

Quentin
Super User

Agree @ChrisHemedinger it coulnd't catch everything.  But since other=_error_ is available for informats, seems like it would be equally useful for formats as well.  Could catch not only invalid data, but also the occasional "oops, I forgot to include a value in my format."  I guess it could be easier for informats, since they are only used when reading values, not every time a value is displayed.  If it were available, I would include it in the toolbox of "defensive programming" techniques.

 

I guess as an alternative, if there were a system option to make SAS throw even a NOTE any time it looked up a value in a format and could not find it (so resorted to best.), that would probably suffice for my use case.  So similar to the note "NOTE: At least one W.D format was too small for the number to be printed. The decimal may be shifted by the "BEST" format" which is thrown when data don't fit the proscribed format, and SAS chooses BEST format instead.  Could be "NOTE: Value not found in format lookup.  BEST format used instead."  In both cases, I've told SAS to apply a format, and the format I've specified cannot be applied for a certain value, so SAS chooses BEST instead, and it would be nice to be told in the log that the format I specified cannot be used.

ChrisNZ
Tourmaline | Level 20

> a system option to make SAS throw even a NOTE any time it looked up a value in a format and could not find it

 

That's the way to go I reckon. Most people don't want more notes, when they only formatted the values that mattered to them (though I do like the idea of leaving no gap, hence this suggestion: https://communities.sas.com/t5/SASware-Ballot-Ideas/Add-the-missing-keyword-in-proc-format/idi-p/226...).

 

On a side note, since we are mentionning options to control the log messages, we need more of these options, so that title font height issues can be made in to a note rather than a warning, and MERGE statement has more than one data set with repeats of BY values can be made into a warning rather than a note. There's just a handful usch options at the moment.