Hi everyone,
I am trying to find an elegant way to collect some metadata for all columns in a data set. What I want to know is, for each and every variable in a data set, how many rows are populated? I would like the result of this query to be output into a single data set which contains one row per variable in the original data, and a column giving the number of non-missing rows for that variable. Here's some example data and output:
data demo_data;
infile datalines dsd truncover;
input Make:$13. Model:$40. Type:$8. Origin:$6. DriveTrain:$5. MSRP:DOLLAR8. Invoice:DOLLAR8. EngineSize:32. Cylinders:32. Horsepower:32. MPG_City:32. MPG_Highway:32. Weight:32. Wheelbase:32. Length:32.;
datalines4;
Acura,,SUV,Asia,All,"$36,945","$33,337",3.5,6,,17,23,4451,106,189
Acura,RSX Type S 2dr,Sedan,Asia,Front,"$23,820","$21,761",2,4,,24,31,2778,101,172
Acura,TSX 4dr,Sedan,Asia,Front,"$26,990","$24,647",2.4,4,,22,29,3230,105,183
Acura,TL 4dr,Sedan,Asia,Front,"$33,195","$30,299",,6,270,20,28,3575,108,186
Acura,3.5 RL 4dr,Sedan,Asia,Front,"$43,755","$39,014",,6,225,18,24,3880,115,197
;;;;
Desired output:
name | num_not_missing |
Cylinders | 5 |
DriveTrain | 5 |
EngineSize | 3 |
Horsepower | 2 |
Invoice | 5 |
Length | 5 |
MPG_City | 5 |
MPG_Highway | 5 |
MSRP | 5 |
Make | 5 |
Model | 4 |
Origin | 5 |
Type | 5 |
Weight | 5 |
Wheelbase | 5 |
... where I have inserted missing values in the demo dataset and the resulting metadata shows that some columns contain missing values.
I do have a solution to this problem already, but I would like to know whether there is a more elegant or SAS-like way to get the same results (if only to satisfy my own curiosity!). My current (inelegant) solution is as follows:
proc contents data=demo_data out=cont_demo_data(keep=name varnum type length) noprint; run;
proc sql noprint;
select catx(" ", "count(", name, ") as ", name)
into :list separated by ', '
from cont_demo_data;
quit;
proc sql;
create table counts_demo_data as
select
&list.
from demo_data;
quit;
proc transpose
data = counts_demo_data
out = counts_demo_data_t(rename=(_NAME_=name COL1=num_not_missing));
run;
So as you can see, it's somewhat inelegant but it does at least get to the correct answer.
Other approaches I've tried that don't give me the result I need:
Thanks!
Adam
p.s. I'm using base SAS 9.2
First, it is possible to use PROC FREQ to generate a single dataset with all the variables and their frequencies. But instead of the OUTPUT statement, you have to use ODS OUTPUT:
proc format;
value nmis .='.' other='x';
value $cmis ' '='.' other='x';
run;
ods output onewayfreqs=freq (keep=table frequency f_:);
ods listing close;
proc freq data=demo_data ;
tables _all_ / missing ;
format _numeric_ nmis. _character_ $cmis.;
run;
ods listing;
ods output close;
But I think your program is more compact. And you can make it even more compact by dropping the PROC CONTENTS, and having the first proc sql read from dictionary.columns:
proc sql noprint;
select catx(" ", "count(", name, ") as ", name)
into :list separated by ', '
from dictionary.columns
where libname='WORK' and memname='DEMO_DATA';
quit;
And I suppose you can put the CREATE TABLE statement in the first PROC SQL step.
In SAS 9.2 that's probably as elegant as you're going to get, and even in 9.4.
The solution I was going to suggest, isn't a succinct. The only risk is for big data sets, where the macro variable may be longer than the 32K character limit.
https://gist.github.com/statgeek/2de1faf1644dc8160fe721056202f111
Full non macro solution
proc summary data= your_data ;
var _numeric_ ;
output nmiss= ;
proc transpose data= _last_( drop= _FREQ_ _TYPE_ )
out= nmiss_counts(rename= col1 = nmiss ) ;
run ;
should work in any old release
unless you have columns in your_data named _type_ or _freq_ --> then I wouldn't know
proc summary data= your_data ; var _numeric_ ; output n= ; proc transpose data= _last_( drop= _FREQ_ _TYPE_ ) out= non_miss_counts(rename= col1 = n ) ; run ;
revised to provide counts of non-missing (proc means / summary statistic N) rather than the NMISS statistic
Based on your data demo_data, I built an 'elegant' macro to give you the desired output dataset.
The basic idea is to 1) use sas sql to count the number of non-missing values; 2) use union to concatenate the names and the non-missing values; 3) use macro and do-loop to go through the variable list and repeat the step1 and 2.
So, as shown in the codes, I first created the variable list, and then calculate the length of the list that will be used for do-loop based on the position of each variable. With the help of do-loop, I combined each variable name and its non-missing value into one dataset, starting from the second variable (that's why I used 'if then' to control the loop). And voila. below is the output dataset.
%let varlist=Make#Model#Type#Origin#DriveTrain#MSRP#Invoice#EngineSize#Cylinders#Horsepower#MPG_City#MPG_Highway#Weight#Wheelbase#Length;
%macro nmiss;
%let varnum=%eval(1+(%length(%sysfunc(compbl(&varlist))) - %length(%sysfunc(compress(&varlist,'#')))));
proc sql;
create table nmiss as
%do i= 1 %to &varnum.;
%if &i ne 1 %then union;
select "%scan(&varlist.,&i,'#')" as name, count(%scan(&varlist.,&i,'#')) as num_not_missing
from demo_data
%end;
;
quit;
%mend;
%nmiss;
Well, if you are not comfortable with macro, you can also use the following codes in sas sql. The coding could be long depending on the number of the variables, but the idea is really straightforward.
For example:
select 'Make' as name, count(Make) as num_not_missing from demo_data
union
select 'Model' as name, count(Model) as num_not_missing from demo_data
union
...
Here a data step approach which should work for SAS 9.2 and for any number of variables in your source data set.
data demo_data;
infile datalines dsd truncover;
input Make:$13. Model:$40. Type:$8. Origin:$6. DriveTrain:$5. MSRP:DOLLAR8. Invoice:DOLLAR8. EngineSize:32. Cylinders:32. Horsepower:32. MPG_City:32. MPG_Highway:32. Weight:32. Wheelbase:32. Length:32.;
datalines4;
Acura,,SUV,Asia,All,"$36,945","$33,337",3.5,6,,17,23,4451,106,189
Acura,RSX Type S 2dr,Sedan,Asia,Front,"$23,820","$21,761",2,4,,24,31,2778,101,172
Acura,TSX 4dr,Sedan,Asia,Front,"$26,990","$24,647",2.4,4,,22,29,3230,105,183
Acura,TL 4dr,Sedan,Asia,Front,"$33,195","$30,299",,6,270,20,28,3575,108,186
Acura,3.5 RL 4dr,Sedan,Asia,Front,"$43,755","$39,014",,6,225,18,24,3880,115,197
;;;;
run;
data _null_;
set demo_data end=last;
array _c_var {*} _character_;
array _n_var {*} _numeric_;
/* hash to collect variable stats */
if _n_=1 then
do;
length name $32 num_not_missing 8;
dcl hash h1(ordered:'y');
h1.defineKey('name');
h1.defineData('name','num_not_missing');
h1.defineDone();
end;
/* stats for char vars */
do i=1 to dim(_c_var);
name=vname(_c_var[i]);
num_not_missing=0;
if h1.find() ne 0 then h1.add();
if not missing(_c_var[i]) then
do;
num_not_missing= sum(num_not_missing, 1);
_rc=h1.replace();
end;
end;
/* stats for num vars */
do i=1 to dim(_n_var);
name=vname(_n_var[i]);
num_not_missing=0;
if h1.find() ne 0 then h1.add();
if not missing(_n_var[i]) then
do;
num_not_missing= sum(num_not_missing, 1);
_rc=h1.replace();
end;
end;
/* write stats to table want */
if last then h1.output(dataset:'want');
run;
proc print data=want noobs;
run;
First, it is possible to use PROC FREQ to generate a single dataset with all the variables and their frequencies. But instead of the OUTPUT statement, you have to use ODS OUTPUT:
proc format;
value nmis .='.' other='x';
value $cmis ' '='.' other='x';
run;
ods output onewayfreqs=freq (keep=table frequency f_:);
ods listing close;
proc freq data=demo_data ;
tables _all_ / missing ;
format _numeric_ nmis. _character_ $cmis.;
run;
ods listing;
ods output close;
But I think your program is more compact. And you can make it even more compact by dropping the PROC CONTENTS, and having the first proc sql read from dictionary.columns:
proc sql noprint;
select catx(" ", "count(", name, ") as ", name)
into :list separated by ', '
from dictionary.columns
where libname='WORK' and memname='DEMO_DATA';
quit;
And I suppose you can put the CREATE TABLE statement in the first PROC SQL step.
Thank you everyone for your responses, this was really helpful! I've some great ideas to go away and try now.
Adam
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.