BookmarkSubscribeRSS Feed

PROC FREQ is a core procedure.  It ought to be possible to display MISSING values at the bottom of the output rather than at the top (without creating a NOTSORTED format for every single variable!).

 

I frequently want to see -- and report -- cumulative frequencies of the non-missing values before seeing the impact of the MISSING values on the distribution of my nominal variables.  It is a big pain-in-the-bottom to not be able to put MISSING values at the bottom.

 

I am suggesting a BOTTOMMISS option -- or perhaps a NOPAININTHE option -- for the FREQ (and TABULATE) procedures.

 

Example:

PROC FREQ DATA=DATA1 ; TABLE V / MISSING NOPAININTHE ; RUN ;

 

v1 Freq % CumFreq Cum%
CAT1 274,948 19.31% 274,948 19.31%
CAT2 6,663 0.47% 281,611 19.78%
CAT3 196,687 13.82% 478,298 33.60%
CAT4 560,603 39.38% 1,038,901 72.97%
CAT5 299,029 21.00% 1,337,930 93.98%
Missing 85,753 6.02% 1,423,683 100.00%
8 Comments
ballardw
Super User

Your concern relates to the ORDER of values. MISSING is always the smallest value.

For one-way freqs as you show it is not hard to work around as needed:

 

data example;
   set sashelp.class;
   if name in ('Alice' 'John') then call missing(age);
run;


proc freq data=example;
  tables age/missing out=work.count;
run;

data need;
   set work.count(where=(not missing(age)))
       work.count(where=(missing(age)))
   ;
   retain cumcount cumperc;
   cumcount= sum(cumcount, count);
   cumperc = sum(cumperc, percent);
run;

 

 

Reeza
Super User

I'd be more inclined to create a macro that automatically created a format and applied it to the variable. I think an option for this type of ordering would be useful though.

bbenbaruch
Quartz | Level 8

From a practical point of view, when I need to deliver a report I actually find it faster to copy my output to Excel and rearrange the data there. 

 

Another quick work-around is to convert missing numeric data to 9999999999 and missing character data to '999999999999' in EBCDIC or to '~~~~~~' in ascii.

 

I am not aware, however, of any "work-arounds" that are practical for large datasets where one has to deal with hundreds of variables.

 

That there are "work-arounds" is beside the point.  My point is that we should not have to concoct work-arounds.

ballardw
Super User

The example I provided could be turned into a very generic macro with 3 parameters in about 5-10 minutes of coding. Though it should probably include a Proc print call which would add 3 lines of code.

 

And will work for as large of a data set as you may have for one-wary frequency tables.

 

There are many examples on this forum of finding generating code to do the same thing will all of the variables in a data set. Create a small macro to call for one variable and data set. Then use one of the mentioned approaches to call the macro for each variable.

Done.

 

 

I have to say that I find the combination of statements "I  actually find it faster to copy my output to Excel and rearrange the data there. " and "we should not have to concoct work-arounds" very entertaining.

Kathryn_SAS
SAS Employee

Some other options to consider are:

- use the MISSPRINT option on the TABLES statement so that missing values are included in the table but not the statistics

- use the DESCENDING option on the CLASS statement in PROC TABULATE

data test;

input x;

cards;

 

1

2

.

;

run;

proc freq data=test ;

tables x / missprint ;

run;

proc tabulate data=test;

class x / descending missing;

table x, n;

run;

 

PaigeMiller
Diamond | Level 26

I think the point of the original suggestion was so that we don't have @ballardw programming his own solution and @Reeza programming another solution and @Kathryn_SAS programming another solution and person ABC not knowing how to do this programming so person ABC doesn't have a solution — why should everyone have to program this, why not have an option that does this?

 

On the other hand, probably many people wouldn't use this option ... I know I have never needed it ... so what is the threshold at which an option becomes important enough to the general SAS population to be important enough for SAS to program (knowing full well that they are not going to program everything that anyone might ever want). I don't know the answer.

 

bbenbaruch
Quartz | Level 8

Ballardw inspired and challenged me to stop concocting an Excel solution -- as practical as it may be when under the deadline to deliver a report in Excel.  (My apologies to her/him if this means that s/he will no longer be entertained by me.)

 

So I developed the following MACRO. For variables with many levels or values, it is impractical without using FORMAT statements to reduce them.  Hence my CONTINU and DISCRETE formats.

 

NOTE: My MACRO assumes that I am operating in an environment using the ASCII collating sequence in which the tilde (~) is the highest value.

 

%MACRO NOPAININTHE ; 
     ARRAY Vn(*) _NUMERIC_ ;
     ARRAY Vc(*) _CHAR_ ;
     DO i = 1 TO DIM(Vn) ;
          IF MISSING(Vn{i}) THEN Vn{i} = 9999999999 ;
     END ;
     DO i = 1 TO DIM(Vc) ;
          IF MISSING(Vc{i}) THEN Vc{i} = '~MISSING' ;
    END ;
%MEND ;


PROC FORMAT ;
VALUE CONTINU LOW-<0="Neg" 0="0" 0-<1="Betw 0-1" 1="1" 1-<2="Betw 1-2" 2="2" 2-<3="2-<3" 3="3" 3-<4="3-<4" 4="4" 4-<5="4-<5" 5="5" 5-<6="5-<6"
6="6" 6-<7="6-<7" 7="7" 7-<8="7-<8" 8="8" 8-<9="8-<9" 9="9" 9-<10="9-<10" 10="10"
10-20="10-20" 20-100="20-100" 100-500="100-500" 500-1000="500-1K" 1000-2500="1K-2.5K" 2500-5000="2.5K-5K"
5000-10000="5K-10K" 10000-25000="10K-25K" 25000-HIGH=">25K" ;


VALUE DISCRETE LOW-<0="Neg" 0="0" 10-20="10-20" 20-100="20-100" 100-500="100-500" 500-1000="500-1K" 1000-2500="1K-2.5K" 2500-5000="2.5K-5K"
5000-10000="5K-10K" 10000-25000="10K-25K" 25000-HIGH=">25K" ;
RUN ;


DATA TEST ;
     SET DATASET ;
     %NOPAININTHE ;
     FORMAT V3. CONTINU. ;
RUN ;


PROC FREQ DATA=TEST ;
     TABLE _ALL_ / MISSING ;
RUN ;

 

SAMPLE RESULTS:

V3
  Freq % CumFreq Cum%
QXQ 73 0.30% 73 0.30%
YQY 1,608 6.50% 1,681 6.80%
Z34 3,917 15.80% 5,598 22.60%
X00 18,935 76.30% 24,533 98.80%
~ 286 1.20% 24,819 100.00%

 

V4
  Freq % CumFreq Cum%
8 8 0.00% 8 0.00%
9 1 0.00% 9 0.00%
10-19 6 0.00% 15 0.10%
20-100 107 0.40% 122 0.50%
100-500 18,487 74.50% 18,609 75.00%
500-1K 3,109 12.50% 21,718 87.50%
1K-2.5K 360 1.50% 22,078 89.00%
2.5K-5K 12 0.00% 22,090 89.00%
5K-10K 1 0.00% 22,091 89.00%
9999999999 2,728 11.00% 24,819 100.00%
ballardw
Super User

I was thinking some more along these lines. First a small macro to do a single variable, then a driver to call multiple times.

Just like proc freq this does ALL the variables if not limited. An exercise for the interested read is to add a list of variables such as a specific table statement might use and work with that.

 

HINT: if the added parameter is provided with values then you don't use the Proc SQL to get the variable list.

 

%macro bottommissing(dsname=, var=);
proc freq data=&dsname. NOPRINT;
  tables &var. /missing out=work.__count;
run;

data work.__need;
   set work.__count(where=(not missing(&var.)))
       work.__count(where=(missing(&var.)))
   ;
   retain __cumcount __cumperc;
   __cumcount= sum(__cumcount, count);
   __cumperc = sum(__cumperc, percent);
   label 
      __cumcount = 'Cumulative Frequency'
      __cumperc  = 'Cumulative Percent'
   ;

run;

Proc print data=work.__need noobs label;
run;
%mend;


%macro bmdriver (data= );
   %if %sysfunc(countw(&data.))= 1 %then %do;
      %let lib=WORK;
      %let mem=%upcase(&data.);
   %end;
   %else %do;
      %let lib= %upcase(%scan(&data.,1));
      %let mem= %upcase(%scan(&data.,2));
   %end;
   options missing=' ';
   proc sql noprint;
      select name into : varlist separated by ' '
      from dictionary.columns 
      where libname="&lib." and memname="&mem."
      ;
   quit;
   %do i = 1 %to %sysfunc(countw(&varlist.));
      %let v = %scan(&varlist., &i);
       %bottommissing(dsname=&data., var=&v.)
   %end;
   options missing='.';
%mend;

/* create some data with missing values*/
data work.example;
   set sashelp.class;
   if name in ('Alice' 'John') then call missing(name, age, height);
   if sex = 'F' and age=14 then call missing(sex, weight);
run;



%bmdriver (data=work.example)