I have a very large dataset and am trying to get a sense of the missing data and how they are distributed. In particular I want to compare amount of missing data across groups. So I'd like to make a table with my groups as columns and my variables as rows, and then have the number of observations (frequencies and maybe percents also) that have missing variable x/y/z for each group. How can I do this? Ive made table with SQL before but dont know how to get it to extract numbers of missing variables.
Well, it's not entirely clear what you are asking, but PROC FREQ will count missing values in a variable, and it can do this by groups.
There's really no need to use PROC SQL to count missings.
Sorry I wasnt clear. Im shooting for something like this:
| Group a | Group b | Group c |
Variable x | 10% missing | 12% missing | 15% missing |
Variable y | 20% missing | 22% missing | 25% missing |
Variable z | 30% missing | 32% missing | 35% missing |
https://gist.github.com/statgeek/2de1faf1644dc8160fe721056202f111
The approach above is similar but you need to add in the GROUP to the output. Either way, FREQ is a better option.
Thanks! At which point(s) do I need to add the group? Is it in the proc freq and then it will carry over in the data steps?
You can get a decent table with relatively little effort using proc tabulate:
title "Number missing";
proc tabulate data=sashelp.heart format=5.0;
class sex smoking_status;
var ageatstart--Cholesterol;
table (ageatstart--Cholesterol), nmiss="Number Missing"*smoking_status*sex="";
run;
--------------------------------------------------------------------------------- | | Number Missing | | |-----------------------------------------------------------| | | Smoking Status | | |-----------------------------------------------------------| | | Heavy (16-| |Moderate (6| |Very Heavy | | | 25) |Light (1-5)| 15) |Non-smoker | (> 25) | | |-----------+-----------+-----------+-----------+-----------| | |Fema-| |Fema-| |Fema-| |Fema-| |Fema-| | | | le |Male | le |Male | le |Male | le |Male | le |Male | |-------------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----| |Age at Start | 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| |-------------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----| |Height | 0| 2| 2| 0| 0| 0| 2| 0| 0| 0| |-------------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----| |Weight | 0| 0| 1| 0| 1| 0| 1| 1| 1| 1| |-------------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----| |Diastolic | 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| |-------------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----| |Systolic | 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| |-------------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----| |Metropolitan | | | | | | | | | | | |Relative Weight | 0| 0| 1| 0| 1| 0| 1| 1| 1| 1| |-------------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----| |Smoking | 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| |-------------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----| |Age at Death | 232| 371| 311| 81| 236| 127| 1146| 464| 42| 192| |-------------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----| |Cholesterol | 8| 9| 12| 4| 9| 4| 52| 13| 5| 8| ---------------------------------------------------------------------------------
Here's a quick and dirty program. It doesn't give you exactly what you asked for, but it's close. It might be so easy that it's "close enough". It requires that your group variable is character, not numeric.
proc tabulate data=have;
class groupvar / missing;
var _numeric_;
tables _numeric_, (n nmiss) * f=comma9. * groupvar=' ';
run;
How many variables are you concerned with having missing values? Are they all character, numeric or a mix of both (one approach for this works well but only for numeric with missing values).
Since this sounds a lot like doing the exact same thing for many variables then the ARRAY functionality of a data step is much simpler than the repetitive coding in Proc SQL: missing(a.var) = var may not be much but if you have hundreds of variables like some of the folks on this forum... And the proc tabulate approach where we have a variable to hold the variable name, and the value in a 1/0 coding to display is going to be a tad easier.
Here's a generic approach for generic data:
data want; set have; array n <list of numeric variables you are concerned about missing>; array c <list of character variables you are concened about missing>; /* two arrays as you can't have an array mixing character and numeric*/ length varname $ 32.; do i = 1 to dim(n); varname = vname(n[i]); missing = missing(vname(n[i])); output; end; do i = 1 to dim(c); varname = vname(c[i]); missing = missing(vname(c[i])); output; end; keep <group variables> varname missing; run; /* this will create one column of output summarizing the missing status of the variables in the arrays above. If you just want membership of any value in group of variables then you need to create the groups, either with explicit assignments in the data set above or possibly formats on the groups variables. Any thing with a "group" that uses values from two or more variables will need to have an explicit assignment in the data set such as GroupA = var1 in (1,2,3) and var2 = 'B'; */ proc tabulate data=want; class <group variables>/missing; class varname; var missing; tables varname='Variable'*missing=''*mean=''*f=percent8.1, <group variables> ; run;
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.