Help using Base SAS procedures

using SQL to categorize missing data

Reply
Occasional Contributor
Posts: 9

using SQL to categorize missing data

I have a very large dataset and am trying to get a sense of the missing data and how they are distributed. In particular I want to compare amount of missing data across groups. So I'd like to make a table with my groups as columns and my variables as rows, and then have the number of observations (frequencies and maybe percents also) that have missing variable x/y/z for each group. How can I do this? Ive made table with SQL before but dont know how to get it to extract numbers of missing variables.

Respected Advisor
Posts: 2,832

Re: using SQL to categorize missing data

[ Edited ]
Posted in reply to newsasuser

Well, it's not entirely clear what you are asking, but PROC FREQ will count missing values in a variable, and it can do this by groups.

 

There's really no need to use PROC SQL to count missings.

--
Paige Miller
Occasional Contributor
Posts: 9

Re: using SQL to categorize missing data

Posted in reply to PaigeMiller

Sorry I wasnt clear. Im shooting for something like this:

 

 

Group a

Group b

Group c

Variable x

10% missing

12% missing

15% missing

Variable y

20% missing

22% missing

25% missing

Variable z

30% missing

32% missing

35% missing

Super User
Posts: 23,319

Re: using SQL to categorize missing data

Posted in reply to newsasuser

https://gist.github.com/statgeek/2de1faf1644dc8160fe721056202f111

 

The approach above is similar but you need to add in the GROUP to the output. Either way, FREQ is a better option.

Occasional Contributor
Posts: 9

Re: using SQL to categorize missing data

Thanks! At which point(s) do I need to add the group? Is it in the proc freq and then it will carry over in the data steps?

Highlighted
Esteemed Advisor
Posts: 5,482

Re: using SQL to categorize missing data

Posted in reply to newsasuser

You can get a decent table with relatively little effort using proc tabulate:

 

title "Number missing";
proc tabulate data=sashelp.heart format=5.0;
class sex smoking_status;
var ageatstart--Cholesterol;
table (ageatstart--Cholesterol), nmiss="Number Missing"*smoking_status*sex="";
run;
 ---------------------------------------------------------------------------------
 |                   |                      Number Missing                       |
 |                   |-----------------------------------------------------------|
 |                   |                      Smoking Status                       |
 |                   |-----------------------------------------------------------|
 |                   | Heavy (16-|           |Moderate (6|           |Very Heavy |
 |                   |    25)    |Light (1-5)|    15)    |Non-smoker |  (> 25)   |
 |                   |-----------+-----------+-----------+-----------+-----------|
 |                   |Fema-|     |Fema-|     |Fema-|     |Fema-|     |Fema-|     |
 |                   | le  |Male | le  |Male | le  |Male | le  |Male | le  |Male |
 |-------------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----|
 |Age at Start       |    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|
 |-------------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----|
 |Height             |    0|    2|    2|    0|    0|    0|    2|    0|    0|    0|
 |-------------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----|
 |Weight             |    0|    0|    1|    0|    1|    0|    1|    1|    1|    1|
 |-------------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----|
 |Diastolic          |    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|
 |-------------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----|
 |Systolic           |    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|
 |-------------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----|
 |Metropolitan       |     |     |     |     |     |     |     |     |     |     |
 |Relative Weight    |    0|    0|    1|    0|    1|    0|    1|    1|    1|    1|
 |-------------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----|
 |Smoking            |    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|
 |-------------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----|
 |Age at Death       |  232|  371|  311|   81|  236|  127| 1146|  464|   42|  192|
 |-------------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----|
 |Cholesterol        |    8|    9|   12|    4|    9|    4|   52|   13|    5|    8|
 ---------------------------------------------------------------------------------
PG
Super User
Posts: 6,637

Re: using SQL to categorize missing data

Posted in reply to PaigeMiller

Here's a quick and dirty program.  It doesn't give you exactly what you asked for, but it's close.  It might be so easy that it's "close enough".  It requires that your group variable is character, not numeric.

 

proc tabulate data=have;

class groupvar / missing;

var _numeric_;

tables _numeric_, (n nmiss) * f=comma9. * groupvar=' ';

run;

 

Super User
Posts: 13,338

Re: using SQL to categorize missing data

[ Edited ]
Posted in reply to newsasuser

How many variables are you concerned with having missing values? Are they all character, numeric or a mix of both (one approach for this works well but only for numeric with missing values).

Since this sounds a lot like doing the exact same thing for many variables then the ARRAY functionality of a data step is much simpler than the repetitive coding in Proc SQL: missing(a.var) = var may not be much but if you have hundreds of variables like some of the folks on this forum...  And the proc tabulate approach where we have a variable to hold the variable name, and the value in a 1/0 coding to display is going to be a tad easier.

Here's a generic approach for generic data:

data want;
   set have;
   array n <list of numeric variables you are concerned about missing>;
   array c <list of character variables you are concened about missing>;
   /* two arrays as you can't have an array mixing character and numeric*/
   length varname $ 32.;
   do i = 1 to dim(n);
      varname = vname(n[i]);
      missing = missing(vname(n[i]));
      output;
   end;
   do i = 1 to dim(c);
      varname = vname(c[i]);
      missing = missing(vname(c[i]));
      output;
   end;
   keep <group variables> varname missing;
run;

/* this will create one column of output summarizing the missing status 
   of the variables in the arrays above.
   If you just want membership of any value in group of variables
   then you need to create the groups, either with explicit assignments
   in the data set above or possibly formats on the groups variables.
   Any thing with a "group" that uses values from two or more variables
   will need to have an explicit assignment in the data set such as 
   GroupA = var1 in (1,2,3) and var2 = 'B';
*/
proc tabulate data=want;
   class <group variables>/missing;
   class varname;
   var missing;
   tables varname='Variable'*missing=''*mean=''*f=percent8.1,
          <group variables>
   ;
run;

 

Ask a Question
Discussion stats
  • 7 replies
  • 266 views
  • 3 likes
  • 6 in conversation