BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
adam_t
Calcite | Level 5

Hi everyone,

 

I am trying to find an elegant way to collect some metadata for all columns in a data set. What I want to know is, for each and every variable in a data set, how many rows are populated? I would like the result of this query to be output into a single data set which contains one row per variable in the original data, and a column giving the number of non-missing rows for that variable. Here's some example data and output:

 

data demo_data;
  infile datalines dsd truncover;
  input Make:$13. Model:$40. Type:$8. Origin:$6. DriveTrain:$5. MSRP:DOLLAR8. Invoice:DOLLAR8. EngineSize:32. Cylinders:32. Horsepower:32. MPG_City:32. MPG_Highway:32. Weight:32. Wheelbase:32. Length:32.;
datalines4;
Acura,,SUV,Asia,All,"$36,945","$33,337",3.5,6,,17,23,4451,106,189
Acura,RSX Type S 2dr,Sedan,Asia,Front,"$23,820","$21,761",2,4,,24,31,2778,101,172
Acura,TSX 4dr,Sedan,Asia,Front,"$26,990","$24,647",2.4,4,,22,29,3230,105,183
Acura,TL 4dr,Sedan,Asia,Front,"$33,195","$30,299",,6,270,20,28,3575,108,186
Acura,3.5 RL 4dr,Sedan,Asia,Front,"$43,755","$39,014",,6,225,18,24,3880,115,197
;;;;

Desired output:

 

namenum_not_missing
Cylinders5
DriveTrain5
EngineSize3
Horsepower2
Invoice5
Length5
MPG_City5
MPG_Highway5
MSRP5
Make5
Model4
Origin5
Type5
Weight5
Wheelbase5

 

... where I have inserted missing values in the demo dataset and the resulting metadata shows that some columns contain missing values.

 

I do have a solution to this problem already, but I would like to know whether there is a more elegant or SAS-like way to get the same results (if only to satisfy my own curiosity!). My current (inelegant) solution is as follows:

 

proc contents data=demo_data out=cont_demo_data(keep=name varnum type length) noprint; run;
proc sql noprint;
	select catx(" ", "count(", name, ") as ", name)
	into :list separated by ', '
	from cont_demo_data;
quit;
proc sql;
	create table counts_demo_data as
	select
		&list.
	from demo_data;
quit;
proc transpose 
	data = counts_demo_data
	out = counts_demo_data_t(rename=(_NAME_=name COL1=num_not_missing));
run;

So as you can see, it's somewhat inelegant but it does at least get to the correct answer.

 

Other approaches I've tried that don't give me the result I need:

  • Using proc means as this will only work for numeric variables
  • proc freq with _NUMERIC_ or _CHAR_ in the tables statement - this gives a separate table for each variable, not all in the same table

Thanks!

 

Adam

 

p.s. I'm using base SAS 9.2

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

First, it is possible to use PROC FREQ to generate a single dataset with all the variables and their frequencies.  But instead of the OUTPUT statement, you have to use ODS OUTPUT:

 

proc format;
  value nmis .='.' other='x';
  value $cmis ' '='.' other='x';
run;

ods output onewayfreqs=freq (keep=table frequency f_:);
ods listing close;
proc freq data=demo_data ;
  tables _all_ / missing ;
  format _numeric_ nmis.  _character_ $cmis.;
run;
ods listing;
ods output close;

 

But I think your program is more compact.  And you can make it even more compact by dropping the PROC CONTENTS, and having the first proc sql read from dictionary.columns:

 

proc sql noprint;
	select catx(" ", "count(", name, ") as ", name)
	into :list separated by ', '
	from dictionary.columns 
          where libname='WORK' and memname='DEMO_DATA';
quit;

 

And I suppose you can put the CREATE TABLE statement in the first PROC SQL step.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

7 REPLIES 7
Reeza
Super User

In SAS 9.2 that's probably as elegant as you're going to get, and even in 9.4. 

 

The solution I was going to suggest, isn't a succinct. The only risk is for big data sets, where the macro variable may be longer than the 32K character limit. 

 

https://gist.github.com/statgeek/2de1faf1644dc8160fe721056202f111

 

Full non macro solution

Peter_C
Rhodochrosite | Level 12
proc summary data= your_data ;
var _numeric_ ;
output nmiss= ;
proc transpose data= _last_( drop= _FREQ_ _TYPE_ )
         out= nmiss_counts(rename= col1 = nmiss ) ;
run ;

should work in any old release

unless you have columns in your_data  named _type_ or _freq_  --> then I wouldn't know

Peter_C
Rhodochrosite | Level 12
proc summary data= your_data ;
var _numeric_ ;
output n= ;
proc transpose data= _last_( drop= _FREQ_ _TYPE_ )
       out= non_miss_counts(rename= col1 = n ) ;
run ;

revised to provide counts of non-missing (proc means / summary statistic N) rather than the NMISS statistic

linkinsean
Calcite | Level 5

Based on your data demo_data, I built an 'elegant' macro to give you the desired output dataset.

 

The basic idea is to 1) use sas sql to count the number of non-missing values; 2) use union to concatenate the names and the non-missing values; 3) use macro and do-loop to go through the variable list and repeat the step1 and 2.

 

So, as shown in the codes, I first created the variable list, and then calculate the length of the list that will be used for do-loop based on the position of each variable. With the help of do-loop, I combined each variable name and its non-missing value into one dataset, starting from the second variable (that's why I used 'if then' to control the loop). And voila. below is the output dataset.

 

 

 Non-missing value datasetNon-missing value dataset

%let varlist=Make#Model#Type#Origin#DriveTrain#MSRP#Invoice#EngineSize#Cylinders#Horsepower#MPG_City#MPG_Highway#Weight#Wheelbase#Length;                                                                                         
                                                                                                                                                                                                                                  
%macro nmiss;                                                                                                                                                                                                                     
%let varnum=%eval(1+(%length(%sysfunc(compbl(&varlist))) - %length(%sysfunc(compress(&varlist,'#')))));                                                                                                                           
proc sql;                                                                                                                                                                                                                         
create table nmiss as                                                                                                                                                                                                             
%do i= 1 %to &varnum.;                                                                                                                                                                                                            
%if &i ne 1 %then union;                                                                                                                                                                                                          
select "%scan(&varlist.,&i,'#')" as name, count(%scan(&varlist.,&i,'#'))  as num_not_missing                                                                                                                                      
from demo_data                                                                                                                                                                                                                    
%end;                                                                                                                                                                                                                             
;                                                                                                                                                                                                                                 
quit;                                                                                                                                                                                                                             
%mend;                                                                                                                                                                                                                            
%nmiss; 

 

Well, if you are not comfortable with macro,  you can also use the following codes in sas sql. The coding could be long depending on the number of the variables, but the idea is really straightforward.

 

For example:

select 'Make' as name, count(Make) as num_not_missing from demo_data

union

select 'Model' as name, count(Model) as num_not_missing from demo_data

union

...

Patrick
Opal | Level 21

@adam_t

Here a data step approach which should work for SAS 9.2 and for any number of variables in your source data set.

data demo_data;
  infile datalines dsd truncover;
  input Make:$13. Model:$40. Type:$8. Origin:$6. DriveTrain:$5. MSRP:DOLLAR8. Invoice:DOLLAR8. EngineSize:32. Cylinders:32. Horsepower:32. MPG_City:32. MPG_Highway:32. Weight:32. Wheelbase:32. Length:32.;
datalines4;
Acura,,SUV,Asia,All,"$36,945","$33,337",3.5,6,,17,23,4451,106,189
Acura,RSX Type S 2dr,Sedan,Asia,Front,"$23,820","$21,761",2,4,,24,31,2778,101,172
Acura,TSX 4dr,Sedan,Asia,Front,"$26,990","$24,647",2.4,4,,22,29,3230,105,183
Acura,TL 4dr,Sedan,Asia,Front,"$33,195","$30,299",,6,270,20,28,3575,108,186
Acura,3.5 RL 4dr,Sedan,Asia,Front,"$43,755","$39,014",,6,225,18,24,3880,115,197
;;;;
run;

data _null_;
  set demo_data end=last;
  array _c_var {*} _character_;
  array _n_var {*} _numeric_;

  /* hash to collect variable stats */
  if _n_=1 then 
    do;
      length name $32 num_not_missing 8;
      dcl hash h1(ordered:'y');
      h1.defineKey('name');
      h1.defineData('name','num_not_missing');
      h1.defineDone();
    end;

  /* stats for char vars */
  do i=1 to dim(_c_var);
    name=vname(_c_var[i]);
    num_not_missing=0;
    if h1.find() ne 0 then h1.add(); 
    if not missing(_c_var[i]) then 
      do;
        num_not_missing= sum(num_not_missing, 1);
        _rc=h1.replace();
      end;
  end;

  /* stats for num vars */
  do i=1 to dim(_n_var);
    name=vname(_n_var[i]);
    num_not_missing=0;
    if h1.find() ne 0 then h1.add(); 
    if not missing(_n_var[i]) then 
      do;
        num_not_missing= sum(num_not_missing, 1);
        _rc=h1.replace();
      end;
  end;

  /* write stats to table want */
  if last then h1.output(dataset:'want');

run;

proc print data=want noobs;
run;

 

mkeintz
PROC Star

First, it is possible to use PROC FREQ to generate a single dataset with all the variables and their frequencies.  But instead of the OUTPUT statement, you have to use ODS OUTPUT:

 

proc format;
  value nmis .='.' other='x';
  value $cmis ' '='.' other='x';
run;

ods output onewayfreqs=freq (keep=table frequency f_:);
ods listing close;
proc freq data=demo_data ;
  tables _all_ / missing ;
  format _numeric_ nmis.  _character_ $cmis.;
run;
ods listing;
ods output close;

 

But I think your program is more compact.  And you can make it even more compact by dropping the PROC CONTENTS, and having the first proc sql read from dictionary.columns:

 

proc sql noprint;
	select catx(" ", "count(", name, ") as ", name)
	into :list separated by ', '
	from dictionary.columns 
          where libname='WORK' and memname='DEMO_DATA';
quit;

 

And I suppose you can put the CREATE TABLE statement in the first PROC SQL step.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
adam_t
Calcite | Level 5

Thank you everyone for your responses, this was really helpful! I've some great ideas to go away and try now.

 

Adam

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 9706 views
  • 7 likes
  • 6 in conversation