SAS Procedures

bolore · Posted 08-23-2011 06:02 PM

Hi

Hope you can help.

I have a dataset that has about 800 variables and 100,000 rows. I want to find the number of missing values for each variable (column). I don't want a proc frequency table for each of the variables. I would like a table output with one column for variable and another column for the number of missing values for each variable.

eg.

variable name # missing values

var 1 3000

var 2 80

var 3 10000

...

var 800 5

Thanks

Bo

Reeza · Posted 08-23-2011 06:58 PM

In two steps using a proc mean and then transposing the results. This assumes you have no character variables. If you do, then a diff solution is needed.

proc means data=sashelp.class noprint;

var age height weight;

output out=try1 (drop= _type_ _freq_) nmiss=;

run;

proc transpose data=try1 out=try2 prefix=nmiss;

run;

MikeZdeb · Posted 08-23-2011 08:30 PM

Hey Art ... pretty sure PROC MEANS only accepts numeric variables in the VAR statement. I get this with your example ...

183 proc means data=have noprint;

184 var _all_;

ERROR: Variable var2 in list does not match type prescribed for this list.

185 output out=temp (drop= _type_ _freq_) nmiss=;

186 run;

How about (assumes there is a variable that uniquely identifies each observation ... here it's NREC) ...

data x;

input name : $10. age zip : $5. salary;

nrec+1;

datalines;

mike 64 12203 9999

sara . 20001 .

. 24 20001 10000

jessica . 90124 999

;

run;

options missing='';

proc transpose data=x out=new (where=(missing(col1)) keep=_name_ col1);

by nrec;

var _all_;

run;

options missing='.';

proc freq data=new;

table _name_ / nocum nopercent;

run;

_NAME_ Frequency

age 2

name 1

salary 1

art297 · Posted 08-23-2011 09:10 PM

Mike,

You are absolutely, totally correct and I am deleting my previous response. You would think after all of these years I'd have learned to look at my log before posting

MikeZdeb · Posted 08-23-2011 09:44 PM

Hi ... I guess it's "log before you leap".

MikeZdeb · Posted 08-23-2011 08:54 PM

Hi ... closer to the data set in the original posting ... took about 25 seconds CPU (elapsed about the same) on a not too fast PC.

* fake data ... 100,000 obs / 600 numeric + 200 character vars

data x;

array nn(600);

array cc(200) $2.;

do id=1 to 1e5;

do k=1 to 200;

nn(k) = ifn(ranuni(999) le .05 , . , 10);

cc(k) = ifc(ranuni(999) le .05 , ' ' , '10');

end;

do k=201 to 600;

nn(k) = ifn(ranuni(999) le .05 , . , 10);

end;

output;

end;

keep id nn: cc: ;

run;

options missing='';

proc transpose data=x out=new (where=(missing(col1)) keep=_name_ col1);

by id;

var _all_;

run;

options missing='.';

proc freq data=new order=freq;

table _name_ / nocum nopercent;

run;

some results ...

_NAME_ Frequency

nn314 5266

cc198 5230

nn124 5213

nn511 5185

cc182 5178

nn392 5169

nn3 5159

DLing · Posted 08-24-2011 03:03 PM

Building off of MikeZdeb's solution, just wanted to see the performance trade-off between PROC TRANSPOSE and the good old DATA STEP. This was run on an old Win XP machine with PC SAS 9.1.3 (stop laughing!) with 10,000 records and 800 variables...

21 options missing='';

22 proc transpose data=x out=new (where=(missing(col1)) keep=_name_ col1);

23 by id;

24 var _all_;

25 run;

NOTE: Numeric variables in the input data set will be converted to character in the output data set.

NOTE: There were 10000 observations read from the data set WORK.X.

NOTE: The data set WORK.NEW has 399281 observations and 2 variables.

NOTE: PROCEDURE TRANSPOSE used (Total process time):

real time 2.42 seconds

cpu time 2.37 seconds

26

27 data new1(keep=__var);

28 set x;

29 array __num(*) _numeric_;

30 array __chr(*) _character_;

31

32 length __var $ 32;

33

34 do i=1 to hbound(__num,1);

35 if missing(__num(i)) then do;

36 __var = vname(__num(i));

37 output;

38 end;

39 end;

40

41 do i=1 to hbound(__chr,1);

42 if missing(__chr(i)) then do;

43 __var = vname(__chr(i));

44 output;

45 end;

46 end;

47 run;

NOTE: There were 10000 observations read from the data set WORK.X.

NOTE: The data set WORK.NEW1 has 399281 observations and 1 variables.

NOTE: DATA statement used (Total process time):

real time 0.38 seconds

cpu time 0.39 seconds

DATA step runs considerably faster, but took more time to code. Is it clearer or more confusing than PROC TRANSPOSE is a matter of personal taste.

Howles · Posted 08-24-2011 04:09 PM

Here's a PROC FREQ approach that does only one pass through the data.

proc format ;

value miss_n low-high = 'N' other = 'Y' ;

value $miss_c ' ' = 'Y' other = 'N' ;

run ;

ods listing close ;

ods output OneWayFreqs=owf ;

proc freq data=x nlevels ;

format _numeric_ miss_n.

_character_ $miss_c. ;

tables _all_ / missing ;

run ;

ods output close ;

ods listing ;

data want (keep = Table Frequency) ;

set owf ;

if cats(of F_:) EQ 'Y' ;

table = substr( table , index(table,' ')+1 ) ;

run ;

Ksharp · Posted 08-24-2011 01:40 AM

It is very interesting. Although in the documentation , it said that nmiss() function is counted for numeric value,But I

found nmiss() is validated for both numeric and character in SQL.

data temp;
 set sashelp.class;
 if ranuni(0) le .4 then call missing(weight);
 if ranuni(0) ge .6 then call missing(name);
run;
proc sql noprint;
 select 'nmiss('||strip(name)||') as '||strip(name) 
  into : list separated by ','
   from dictionary.columns
    where libname='WORK' and memname='TEMP';

 create table have as
  select &list
   from temp;
quit;

proc transpose data=have out=want prefix=missing_value;
run;

Ksharp

bolore · Posted 08-24-2011 01:58 PM

Hi All,

Thanks a lot fo the replies, especially MikeZDeb - that was realy useful!

FriedEgg · Posted 08-24-2011 02:40 PM

proc format;

invalue blahfmt (fuzz=.09999999999)

.1 = .A

.2 = .B

.3 = .C

.4 = .D

.5 = .E

other = .;

run;

data _null_;

nmiss=0;

do i=1 to 1000;

blah=input(ranuni(1234),blahfmt.);

if blah=. then nmiss+1;

end;

putlog nmiss=;

run;

art297 · Posted 08-24-2011 10:43 PM

I would have thought that Howard's offering would be the quickest, but it ended up (at least in my tests) to be the slowest. The winner, I think, is Ksharp's proc sql offering and by a fairly large margin.

SAS Procedures

Counting Missing Values

Counting Missing Values

Re: Counting Missing Values

Re: Counting Missing Values

Re: Counting Missing Values

Re: Counting Missing Values

Re: Counting Missing Values

Re: Counting Missing Values

Re: Counting Missing Values

Counting Missing Values

Counting Missing Values

Re: Counting Missing Values

proc sql count non missing observations in variable by group

Swap rows with missing values with the ones without missing

Count number of missing dates within group

Conditionally remove missing values

Missing values after merging

Follow Us

What is...

SAS Procedures

Our biggest data and AI event of the year.

SAS Training: Just a Click Away

Follow Us

What is...