Solved: How to count non-missing rows for *all* columns and store results in a...

adam_t · Posted 10-13-2017 11:34 AM

Hi everyone,

I am trying to find an elegant way to collect some metadata for all columns in a data set. What I want to know is, for each and every variable in a data set, how many rows are populated? I would like the result of this query to be output into a single data set which contains one row per variable in the original data, and a column giving the number of non-missing rows for that variable. Here's some example data and output:

data demo_data;
  infile datalines dsd truncover;
  input Make:$13. Model:$40. Type:$8. Origin:$6. DriveTrain:$5. MSRP:DOLLAR8. Invoice:DOLLAR8. EngineSize:32. Cylinders:32. Horsepower:32. MPG_City:32. MPG_Highway:32. Weight:32. Wheelbase:32. Length:32.;
datalines4;
Acura,,SUV,Asia,All,"$36,945","$33,337",3.5,6,,17,23,4451,106,189
Acura,RSX Type S 2dr,Sedan,Asia,Front,"$23,820","$21,761",2,4,,24,31,2778,101,172
Acura,TSX 4dr,Sedan,Asia,Front,"$26,990","$24,647",2.4,4,,22,29,3230,105,183
Acura,TL 4dr,Sedan,Asia,Front,"$33,195","$30,299",,6,270,20,28,3575,108,186
Acura,3.5 RL 4dr,Sedan,Asia,Front,"$43,755","$39,014",,6,225,18,24,3880,115,197
;;;;

Desired output:

name	num_not_missing
Cylinders	5
DriveTrain	5
EngineSize	3
Horsepower	2
Invoice	5
Length	5
MPG_City	5
MPG_Highway	5
MSRP	5
Make	5
Model	4
Origin	5
Type	5
Weight	5
Wheelbase	5

... where I have inserted missing values in the demo dataset and the resulting metadata shows that some columns contain missing values.

I do have a solution to this problem already, but I would like to know whether there is a more elegant or SAS-like way to get the same results (if only to satisfy my own curiosity!). My current (inelegant) solution is as follows:

proc contents data=demo_data out=cont_demo_data(keep=name varnum type length) noprint; run;
proc sql noprint;
	select catx(" ", "count(", name, ") as ", name)
	into :list separated by ', '
	from cont_demo_data;
quit;
proc sql;
	create table counts_demo_data as
	select
		&list.
	from demo_data;
quit;
proc transpose 
	data = counts_demo_data
	out = counts_demo_data_t(rename=(_NAME_=name COL1=num_not_missing));
run;

So as you can see, it's somewhat inelegant but it does at least get to the correct answer.

Other approaches I've tried that don't give me the result I need:

Using proc means as this will only work for numeric variables
proc freq with _NUMERIC_ or _CHAR_ in the tables statement - this gives a separate table for each variable, not all in the same table

Thanks!

Adam

p.s. I'm using base SAS 9.2

mkeintz · Posted 10-15-2017 02:11 PM

First, it is possible to use PROC FREQ to generate a single dataset with all the variables and their frequencies. But instead of the OUTPUT statement, you have to use ODS OUTPUT:

proc format;
  value nmis .='.' other='x';
  value $cmis ' '='.' other='x';
run;

ods output onewayfreqs=freq (keep=table frequency f_:);
ods listing close;
proc freq data=demo_data ;
  tables _all_ / missing ;
  format _numeric_ nmis.  _character_ $cmis.;
run;
ods listing;
ods output close;

But I think your program is more compact. And you can make it even more compact by dropping the PROC CONTENTS, and having the first proc sql read from dictionary.columns:

proc sql noprint;
	select catx(" ", "count(", name, ") as ", name)
	into :list separated by ', '
	from dictionary.columns 
          where libname='WORK' and memname='DEMO_DATA';
quit;

And I suppose you can put the CREATE TABLE statement in the first PROC SQL step.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

Reeza · Posted 10-13-2017 11:39 AM

In SAS 9.2 that's probably as elegant as you're going to get, and even in 9.4.

The solution I was going to suggest, isn't a succinct. The only risk is for big data sets, where the macro variable may be longer than the 32K character limit.

https://gist.github.com/statgeek/2de1faf1644dc8160fe721056202f111

Full non macro solution

Peter_C · Posted 10-13-2017 03:31 PM

proc summary data= your_data ;
var _numeric_ ;
output nmiss= ;
proc transpose data= _last_( drop= _FREQ_ _TYPE_ )
         out= nmiss_counts(rename= col1 = nmiss ) ;
run ;

should work in any old release

unless you have columns in your_data named _type_ or _freq_ --> then I wouldn't know

Peter_C · Posted 10-13-2017 03:37 PM

proc summary data= your_data ;
var _numeric_ ;
output n= ;
proc transpose data= _last_( drop= _FREQ_ _TYPE_ )
       out= non_miss_counts(rename= col1 = n ) ;
run ;

revised to provide counts of non-missing (proc means / summary statistic N) rather than the NMISS statistic

linkinsean · Posted 10-13-2017 05:04 PM

Based on your data demo_data, I built an 'elegant' macro to give you the desired output dataset.

The basic idea is to 1) use sas sql to count the number of non-missing values; 2) use union to concatenate the names and the non-missing values; 3) use macro and do-loop to go through the variable list and repeat the step1 and 2.

So, as shown in the codes, I first created the variable list, and then calculate the length of the list that will be used for do-loop based on the position of each variable. With the help of do-loop, I combined each variable name and its non-missing value into one dataset, starting from the second variable (that's why I used 'if then' to control the loop). And voila. below is the output dataset.

Non-missing value dataset

%let varlist=Make#Model#Type#Origin#DriveTrain#MSRP#Invoice#EngineSize#Cylinders#Horsepower#MPG_City#MPG_Highway#Weight#Wheelbase#Length;                                                                                         
                                                                                                                                                                                                                                  
%macro nmiss;                                                                                                                                                                                                                     
%let varnum=%eval(1+(%length(%sysfunc(compbl(&varlist))) - %length(%sysfunc(compress(&varlist,'#')))));                                                                                                                           
proc sql;                                                                                                                                                                                                                         
create table nmiss as                                                                                                                                                                                                             
%do i= 1 %to &varnum.;                                                                                                                                                                                                            
%if &i ne 1 %then union;                                                                                                                                                                                                          
select "%scan(&varlist.,&i,'#')" as name, count(%scan(&varlist.,&i,'#'))  as num_not_missing                                                                                                                                      
from demo_data                                                                                                                                                                                                                    
%end;                                                                                                                                                                                                                             
;                                                                                                                                                                                                                                 
quit;                                                                                                                                                                                                                             
%mend;                                                                                                                                                                                                                            
%nmiss;

Well, if you are not comfortable with macro, you can also use the following codes in sas sql. The coding could be long depending on the number of the variables, but the idea is really straightforward.

For example:

select 'Make' as name, count(Make) as num_not_missing from demo_data

union

select 'Model' as name, count(Model) as num_not_missing from demo_data

union

...

Patrick · Posted 10-14-2017 11:35 PM

@adam_t

Here a data step approach which should work for SAS 9.2 and for any number of variables in your source data set.

data demo_data;
  infile datalines dsd truncover;
  input Make:$13. Model:$40. Type:$8. Origin:$6. DriveTrain:$5. MSRP:DOLLAR8. Invoice:DOLLAR8. EngineSize:32. Cylinders:32. Horsepower:32. MPG_City:32. MPG_Highway:32. Weight:32. Wheelbase:32. Length:32.;
datalines4;
Acura,,SUV,Asia,All,"$36,945","$33,337",3.5,6,,17,23,4451,106,189
Acura,RSX Type S 2dr,Sedan,Asia,Front,"$23,820","$21,761",2,4,,24,31,2778,101,172
Acura,TSX 4dr,Sedan,Asia,Front,"$26,990","$24,647",2.4,4,,22,29,3230,105,183
Acura,TL 4dr,Sedan,Asia,Front,"$33,195","$30,299",,6,270,20,28,3575,108,186
Acura,3.5 RL 4dr,Sedan,Asia,Front,"$43,755","$39,014",,6,225,18,24,3880,115,197
;;;;
run;

data _null_;
  set demo_data end=last;
  array _c_var {*} _character_;
  array _n_var {*} _numeric_;

  /* hash to collect variable stats */
  if _n_=1 then 
    do;
      length name $32 num_not_missing 8;
      dcl hash h1(ordered:'y');
      h1.defineKey('name');
      h1.defineData('name','num_not_missing');
      h1.defineDone();
    end;

  /* stats for char vars */
  do i=1 to dim(_c_var);
    name=vname(_c_var[i]);
    num_not_missing=0;
    if h1.find() ne 0 then h1.add(); 
    if not missing(_c_var[i]) then 
      do;
        num_not_missing= sum(num_not_missing, 1);
        _rc=h1.replace();
      end;
  end;

  /* stats for num vars */
  do i=1 to dim(_n_var);
    name=vname(_n_var[i]);
    num_not_missing=0;
    if h1.find() ne 0 then h1.add(); 
    if not missing(_n_var[i]) then 
      do;
        num_not_missing= sum(num_not_missing, 1);
        _rc=h1.replace();
      end;
  end;

  /* write stats to table want */
  if last then h1.output(dataset:'want');

run;

proc print data=want noobs;
run;

mkeintz · Posted 10-15-2017 02:11 PM

First, it is possible to use PROC FREQ to generate a single dataset with all the variables and their frequencies. But instead of the OUTPUT statement, you have to use ODS OUTPUT:

proc format;
  value nmis .='.' other='x';
  value $cmis ' '='.' other='x';
run;

ods output onewayfreqs=freq (keep=table frequency f_:);
ods listing close;
proc freq data=demo_data ;
  tables _all_ / missing ;
  format _numeric_ nmis.  _character_ $cmis.;
run;
ods listing;
ods output close;

But I think your program is more compact. And you can make it even more compact by dropping the PROC CONTENTS, and having the first proc sql read from dictionary.columns:

proc sql noprint;
	select catx(" ", "count(", name, ") as ", name)
	into :list separated by ', '
	from dictionary.columns 
          where libname='WORK' and memname='DEMO_DATA';
quit;

And I suppose you can put the CREATE TABLE statement in the first PROC SQL step.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

adam_t · Posted 10-17-2017 07:32 AM

Thank you everyone for your responses, this was really helpful! I've some great ideas to go away and try now.

Adam

How to count non-missing rows for all columns and store results in a single data set

Re: How to count non-missing rows for all columns and store results in a single data set

Re: How to count non-missing rows for all columns and store results in a single data set

Re: How to count non-missing rows for all columns and store results in a single data set

Re: How to count non-missing rows for all columns and store results in a single data set

Re: How to count non-missing rows for all columns and store results in a single data set

Re: How to count non-missing rows for all columns and store results in a single data set

Re: How to count non-missing rows for all columns and store results in a single data set

Re: How to count non-missing rows for all columns and store results in a single data set

SAS Innovate 2025: Call for Content

Classroom Training Available!