03-20-2014 02:46 PM
After a lot of fumbling around, I came to the following guess: when building a table on a formatted variable, Proc Freq confuses all values that format the same as a missing value to missing values. The following example illustrates this:
proc format; value test low-0 = "LOW" OTHER = "HIGH"; run;
data test; output; do x = -3 to 3; output; end; format x test.; run;
proc print; run;
proc freq data=test; table x; run;
Notice the absence of the HIGH category in Freq output. Remove the first OUTPUT statement in the datastep (thereby removing the missing x value from the dataset) and the HIGH category reappears in Freq output.
Had anybody else noticed this?
03-20-2014 03:40 PM
Not the same but a similar oddity if you run it through Proc Summary the missing turns into a -3.
proc summary data=test nway;
output out=testsum n=count;
format x f4.;
Proc summary returns the smallest non-missing value as the value of the formatted class variable.
Which is why my custom formats pretty much always have a missing category if I use the Other option.
03-20-2014 03:48 PM
When PROC FREQ counts all the HIGH and LOW values, it only stores one numeric value for each category. It stores the lowest value that actually appears in the data set. So here are some results I would expect from your test.
1. If PROC FREQ were to create an output data set, the actual unformatted values for X would be missing and -3.
2. If you were to add the MISSING option when creating the table, HIGH would appear first and LOW would appear second because missing is less than -3.
3. If you were to remove the first OUTPUT statement, but still include the MISSING option, the order would switch and LOW would appear before HIGH because -3 is less than 1.
In every case, though, the unformatted values in the output data set would clarify what PROC FREQ is doing.
03-20-2014 04:55 PM
Thank you and you help me understand what's going on. There is some logic in storing the lowest value represented in a category. The logic fails when it extends to missing values because the special treatment given almost everywhere to missing values is de facto extended to non-missing values. This can be very confusing; it was for me.
I will try to always remember Ballardw's suggestion for always including an explicit missing category when defining user formats.
I wish I had read this in SAS doc.