This seems like it should be quite easy to do, but I haven't found a way yet. I have a large file (millions of records) with names, dates of birth, addresses, etc, and I want to know the missing/non-missing percentages for each of these variables. I don't want to run proc freq, as this will give me a huge list of every name in the dataset.
My work-around so far has been to recode these types of variables into dichotomous variables (ie, 0=missing and 1=non-missing), and then run freqs. But that's time consuming and tedious. Isn't there an option in proc freq or another procedure to do this automatically?
Hello @Wolverine,
@Wolverine wrote:
I don't want to run proc freq, as this will give me a huge list of every name in the dataset.
This doesn't happen if you apply a suitable format to the variables in question such as
proc format;
value miss
._-.z = 'missing'
other = 'non-missing';
value $miss
' ' = 'missing'
other = 'non-missing';
run;
proc freq data=sashelp.heart;
format _numeric_ miss. _char_ $miss.;
tables _all_ / missing;
run;
Hello @Wolverine,
@Wolverine wrote:
I don't want to run proc freq, as this will give me a huge list of every name in the dataset.
This doesn't happen if you apply a suitable format to the variables in question such as
proc format;
value miss
._-.z = 'missing'
other = 'non-missing';
value $miss
' ' = 'missing'
other = 'non-missing';
run;
proc freq data=sashelp.heart;
format _numeric_ miss. _char_ $miss.;
tables _all_ / missing;
run;
Here's a variant on the program from @FreelanceReinh that cleans up the display into a few nice options.
/*This program creates a report with the number and percent of
missing data for each variable in the data set.
The ony change should be to the macro variable, INPUT_DSN.
Author: F. Khurshed
Date: 2019-01-04
*/
*create sample data to work with;
data class;
set sashelp.class;
if age=14 then
call missing(height, weight, sex);
if name='Alfred' then
call missing(sex, age, height);
label age="Fancy Age Label";
run;
*set input data set name;
%let INPUT_DSN = class;
%let OUTPUT_DSN = want;
*create format for missing;
proc format;
value $ missfmt ' '="Missing" other="Not Missing";
value nmissfmt .="Missing" other="Not Missing";
run;
*Proc freq to count missing/non missing;
ods select none;
*turns off the output so the results do not get too messy;
ods table onewayfreqs=temp;
proc freq data=&INPUT_DSN.;
table _all_ / missing;
format _numeric_ nmissfmt. _character_ $missfmt.;
run;
ods select all;
*Format output;
data long;
length variable $32. variable_value $50.;
set temp;
Variable=scan(table, 2);
Variable_Value=strip(trim(vvaluex(variable)));
presentation=catt(frequency, " (", trim(put(percent/100, percent7.1)), ")");
keep variable variable_value frequency percent cum: presentation;
label variable='Variable' variable_value='Variable Value';
run;
proc sort data=long;
by variable;
run;
*make it a wide data set for presentation, with values as N (Percent);
proc transpose data=long out=wide_presentation (drop=_name_);
by variable;
id variable_value;
var presentation;
run;
*transpose only N;
proc transpose data=long out=wide_N prefix=N_;
by variable;
id variable_value;
var frequency;
run;
*transpose only percents;
proc transpose data=long out=wide_PCT prefix=PCT_;
by variable;
id variable_value;
var percent;
run;
*final output file;
data &Output_DSN.;
merge wide_N wide_PCT wide_presentation;
by variable;
drop _name_;
label N_Missing='# Missing' N_Not_Missing='# Not Missing'
PCT_Missing='% Missing' N_Not_Missing='% Not Missing' Missing='Missing'
Not_missing='Not Missing';
run;
title "Missing Report of &INPUT_DSN.";
proc print data=&output_dsn. noobs label;
run;
Change the macro variables in the code and re-run it on your own data set to get your own results.
It should be pretty fast.
*change these to what you need in code above;
%let INPUT_DSN = class;
%let OUTPUT_DSN = want;
@Wolverine wrote:
This seems like it should be quite easy to do, but I haven't found a way yet. I have a large file (millions of records) with names, dates of birth, addresses, etc, and I want to know the missing/non-missing percentages for each of these variables. I don't want to run proc freq, as this will give me a huge list of every name in the dataset.
My work-around so far has been to recode these types of variables into dichotomous variables (ie, 0=missing and 1=non-missing), and then run freqs. But that's time consuming and tedious. Isn't there an option in proc freq or another procedure to do this automatically?
%let lib=sashelp;
%let dsn=heart;
data _null_;
set sashelp.vcolumn(where=(libname=upcase("&lib") and memname=upcase("&dsn"))) end=last;
if _n_=1 then call execute('proc sql;create table want as select ');
call execute(cat("nmiss(",name,")/(select count(*) from &lib..&dsn ) as ",name," format=percent8.2"));
if last then call execute("from &lib..&dsn ;quit;" );
else call execute(',');
run;
I accepted FreelanceReinhard's solution because it's simple and easy to apply, and it serves my purposes quite well for this particular data. But I will make a note of Reeza's macro, as I anticipate it could be useful in the future for larger/more complex datasets. Thanks!
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
Still thinking about your presentation idea? The submission deadline has been extended to Friday, Nov. 14, at 11:59 p.m. ET.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.