BookmarkSubscribeRSS Feed
newsasuser
Calcite | Level 5

I have a very large dataset and am trying to get a sense of the missing data and how they are distributed. In particular I want to compare amount of missing data across groups. So I'd like to make a table with my groups as columns and my variables as rows, and then have the number of observations (frequencies and maybe percents also) that have missing variable x/y/z for each group. How can I do this? Ive made table with SQL before but dont know how to get it to extract numbers of missing variables.

7 REPLIES 7
PaigeMiller
Diamond | Level 26

Well, it's not entirely clear what you are asking, but PROC FREQ will count missing values in a variable, and it can do this by groups.

 

There's really no need to use PROC SQL to count missings.

--
Paige Miller
newsasuser
Calcite | Level 5

Sorry I wasnt clear. Im shooting for something like this:

 

 

Group a

Group b

Group c

Variable x

10% missing

12% missing

15% missing

Variable y

20% missing

22% missing

25% missing

Variable z

30% missing

32% missing

35% missing

Reeza
Super User

https://gist.github.com/statgeek/2de1faf1644dc8160fe721056202f111

 

The approach above is similar but you need to add in the GROUP to the output. Either way, FREQ is a better option.

newsasuser
Calcite | Level 5

Thanks! At which point(s) do I need to add the group? Is it in the proc freq and then it will carry over in the data steps?

PGStats
Opal | Level 21

You can get a decent table with relatively little effort using proc tabulate:

 

title "Number missing";
proc tabulate data=sashelp.heart format=5.0;
class sex smoking_status;
var ageatstart--Cholesterol;
table (ageatstart--Cholesterol), nmiss="Number Missing"*smoking_status*sex="";
run;
 ---------------------------------------------------------------------------------
 |                   |                      Number Missing                       |
 |                   |-----------------------------------------------------------|
 |                   |                      Smoking Status                       |
 |                   |-----------------------------------------------------------|
 |                   | Heavy (16-|           |Moderate (6|           |Very Heavy |
 |                   |    25)    |Light (1-5)|    15)    |Non-smoker |  (> 25)   |
 |                   |-----------+-----------+-----------+-----------+-----------|
 |                   |Fema-|     |Fema-|     |Fema-|     |Fema-|     |Fema-|     |
 |                   | le  |Male | le  |Male | le  |Male | le  |Male | le  |Male |
 |-------------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----|
 |Age at Start       |    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|
 |-------------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----|
 |Height             |    0|    2|    2|    0|    0|    0|    2|    0|    0|    0|
 |-------------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----|
 |Weight             |    0|    0|    1|    0|    1|    0|    1|    1|    1|    1|
 |-------------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----|
 |Diastolic          |    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|
 |-------------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----|
 |Systolic           |    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|
 |-------------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----|
 |Metropolitan       |     |     |     |     |     |     |     |     |     |     |
 |Relative Weight    |    0|    0|    1|    0|    1|    0|    1|    1|    1|    1|
 |-------------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----|
 |Smoking            |    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|
 |-------------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----|
 |Age at Death       |  232|  371|  311|   81|  236|  127| 1146|  464|   42|  192|
 |-------------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----|
 |Cholesterol        |    8|    9|   12|    4|    9|    4|   52|   13|    5|    8|
 ---------------------------------------------------------------------------------
PG
Astounding
PROC Star

Here's a quick and dirty program.  It doesn't give you exactly what you asked for, but it's close.  It might be so easy that it's "close enough".  It requires that your group variable is character, not numeric.

 

proc tabulate data=have;

class groupvar / missing;

var _numeric_;

tables _numeric_, (n nmiss) * f=comma9. * groupvar=' ';

run;

 

ballardw
Super User

How many variables are you concerned with having missing values? Are they all character, numeric or a mix of both (one approach for this works well but only for numeric with missing values).

Since this sounds a lot like doing the exact same thing for many variables then the ARRAY functionality of a data step is much simpler than the repetitive coding in Proc SQL: missing(a.var) = var may not be much but if you have hundreds of variables like some of the folks on this forum...  And the proc tabulate approach where we have a variable to hold the variable name, and the value in a 1/0 coding to display is going to be a tad easier.

Here's a generic approach for generic data:

data want;
   set have;
   array n <list of numeric variables you are concerned about missing>;
   array c <list of character variables you are concened about missing>;
   /* two arrays as you can't have an array mixing character and numeric*/
   length varname $ 32.;
   do i = 1 to dim(n);
      varname = vname(n[i]);
      missing = missing(vname(n[i]));
      output;
   end;
   do i = 1 to dim(c);
      varname = vname(c[i]);
      missing = missing(vname(c[i]));
      output;
   end;
   keep <group variables> varname missing;
run;

/* this will create one column of output summarizing the missing status 
   of the variables in the arrays above.
   If you just want membership of any value in group of variables
   then you need to create the groups, either with explicit assignments
   in the data set above or possibly formats on the groups variables.
   Any thing with a "group" that uses values from two or more variables
   will need to have an explicit assignment in the data set such as 
   GroupA = var1 in (1,2,3) and var2 = 'B';
*/
proc tabulate data=want;
   class <group variables>/missing;
   class varname;
   var missing;
   tables varname='Variable'*missing=''*mean=''*f=percent8.1,
          <group variables>
   ;
run;

 

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 1478 views
  • 3 likes
  • 6 in conversation