Re: using SQL to categorize missing data

newsasuser · Posted 02-05-2018 02:43 PM

I have a very large dataset and am trying to get a sense of the missing data and how they are distributed. In particular I want to compare amount of missing data across groups. So I'd like to make a table with my groups as columns and my variables as rows, and then have the number of observations (frequencies and maybe percents also) that have missing variable x/y/z for each group. How can I do this? Ive made table with SQL before but dont know how to get it to extract numbers of missing variables.

PaigeMiller · Posted 02-05-2018 03:10 PM

Well, it's not entirely clear what you are asking, but PROC FREQ will count missing values in a variable, and it can do this by groups.

There's really no need to use PROC SQL to count missings.

--
Paige Miller

newsasuser · Posted 02-05-2018 03:13 PM

Sorry I wasnt clear. Im shooting for something like this:

	Group a	Group b	Group c
Variable x	10% missing	12% missing	15% missing
Variable y	20% missing	22% missing	25% missing
Variable z	30% missing	32% missing	35% missing

Reeza · Posted 02-05-2018 03:40 PM

https://gist.github.com/statgeek/2de1faf1644dc8160fe721056202f111

The approach above is similar but you need to add in the GROUP to the output. Either way, FREQ is a better option.

newsasuser · Posted 02-05-2018 03:44 PM

Thanks! At which point(s) do I need to add the group? Is it in the proc freq and then it will carry over in the data steps?

PGStats · Posted 02-05-2018 04:09 PM

You can get a decent table with relatively little effort using proc tabulate:

title "Number missing";
proc tabulate data=sashelp.heart format=5.0;
class sex smoking_status;
var ageatstart--Cholesterol;
table (ageatstart--Cholesterol), nmiss="Number Missing"*smoking_status*sex="";
run;

 ---------------------------------------------------------------------------------
 |                   |                      Number Missing                       |
 |                   |-----------------------------------------------------------|
 |                   |                      Smoking Status                       |
 |                   |-----------------------------------------------------------|
 |                   | Heavy (16-|           |Moderate (6|           |Very Heavy |
 |                   |    25)    |Light (1-5)|    15)    |Non-smoker |  (> 25)   |
 |                   |-----------+-----------+-----------+-----------+-----------|
 |                   |Fema-|     |Fema-|     |Fema-|     |Fema-|     |Fema-|     |
 |                   | le  |Male | le  |Male | le  |Male | le  |Male | le  |Male |
 |-------------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----|
 |Age at Start       |    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|
 |-------------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----|
 |Height             |    0|    2|    2|    0|    0|    0|    2|    0|    0|    0|
 |-------------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----|
 |Weight             |    0|    0|    1|    0|    1|    0|    1|    1|    1|    1|
 |-------------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----|
 |Diastolic          |    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|
 |-------------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----|
 |Systolic           |    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|
 |-------------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----|
 |Metropolitan       |     |     |     |     |     |     |     |     |     |     |
 |Relative Weight    |    0|    0|    1|    0|    1|    0|    1|    1|    1|    1|
 |-------------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----|
 |Smoking            |    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|
 |-------------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----|
 |Age at Death       |  232|  371|  311|   81|  236|  127| 1146|  464|   42|  192|
 |-------------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----|
 |Cholesterol        |    8|    9|   12|    4|    9|    4|   52|   13|    5|    8|
 ---------------------------------------------------------------------------------

PG

Astounding · Posted 02-05-2018 03:55 PM

Here's a quick and dirty program. It doesn't give you exactly what you asked for, but it's close. It might be so easy that it's "close enough". It requires that your group variable is character, not numeric.

proc tabulate data=have;

class groupvar / missing;

var _numeric_;

tables _numeric_, (n nmiss) * f=comma9. * groupvar=' ';

run;

ballardw · Posted 02-05-2018 03:38 PM

How many variables are you concerned with having missing values? Are they all character, numeric or a mix of both (one approach for this works well but only for numeric with missing values).

Since this sounds a lot like doing the exact same thing for many variables then the ARRAY functionality of a data step is much simpler than the repetitive coding in Proc SQL: missing(a.var) = var may not be much but if you have hundreds of variables like some of the folks on this forum... And the proc tabulate approach where we have a variable to hold the variable name, and the value in a 1/0 coding to display is going to be a tad easier.

Here's a generic approach for generic data:

data want;
   set have;
   array n <list of numeric variables you are concerned about missing>;
   array c <list of character variables you are concened about missing>;
   /* two arrays as you can't have an array mixing character and numeric*/
   length varname $ 32.;
   do i = 1 to dim(n);
      varname = vname(n[i]);
      missing = missing(vname(n[i]));
      output;
   end;
   do i = 1 to dim(c);
      varname = vname(c[i]);
      missing = missing(vname(c[i]));
      output;
   end;
   keep <group variables> varname missing;
run;

/* this will create one column of output summarizing the missing status 
   of the variables in the arrays above.
   If you just want membership of any value in group of variables
   then you need to create the groups, either with explicit assignments
   in the data set above or possibly formats on the groups variables.
   Any thing with a "group" that uses values from two or more variables
   will need to have an explicit assignment in the data set such as 
   GroupA = var1 in (1,2,3) and var2 = 'B';
*/
proc tabulate data=want;
   class <group variables>/missing;
   class varname;
   var missing;
   tables varname='Variable'*missing=''*mean=''*f=percent8.1,
          <group variables>
   ;
run;

SAS Innovate 2025: Call for Content

Classroom Training Available!