- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi
Hope you can help.
I have a dataset that has about 800 variables and 100,000 rows. I want to find the number of missing values for each variable (column). I don't want a proc frequency table for each of the variables. I would like a table output with one column for variable and another column for the number of missing values for each variable.
eg.
variable name # missing values
var 1 3000
var 2 80
var 3 10000
...
var 800 5
Thanks
Bo
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
In two steps using a proc mean and then transposing the results. This assumes you have no character variables. If you do, then a diff solution is needed.
proc means data=sashelp.class noprint;
var age height weight;
output out=try1 (drop= _type_ _freq_) nmiss=;
run;
proc transpose data=try1 out=try2 prefix=nmiss;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hey Art ... pretty sure PROC MEANS only accepts numeric variables in the VAR statement. I get this with your example ...
183 proc means data=have noprint;
184 var _all_;
ERROR: Variable var2 in list does not match type prescribed for this list.
185 output out=temp (drop= _type_ _freq_) nmiss=;
186 run;
How about (assumes there is a variable that uniquely identifies each observation ... here it's NREC) ...
data x;
input name : $10. age zip : $5. salary;
nrec+1;
datalines;
mike 64 12203 9999
sara . 20001 .
. 24 20001 10000
jessica . 90124 999
;
run;
options missing='';
proc transpose data=x out=new (where=(missing(col1)) keep=_name_ col1);
by nrec;
var _all_;
run;
options missing='.';
proc freq data=new;
table _name_ / nocum nopercent;
run;
_NAME_ Frequency
age 2
name 1
salary 1
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Mike,
You are absolutely, totally correct and I am deleting my previous response. You would think after all of these years I'd have learned to look at my log before posting
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi ... I guess it's "log before you leap".
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi ... closer to the data set in the original posting ... took about 25 seconds CPU (elapsed about the same) on a not too fast PC.
* fake data ... 100,000 obs / 600 numeric + 200 character vars
data x;
array nn(600);
array cc(200) $2.;
do id=1 to 1e5;
do k=1 to 200;
nn(k) = ifn(ranuni(999) le .05 , . , 10);
cc(k) = ifc(ranuni(999) le .05 , ' ' , '10');
end;
do k=201 to 600;
nn(k) = ifn(ranuni(999) le .05 , . , 10);
end;
output;
end;
keep id nn: cc: ;
run;
options missing='';
proc transpose data=x out=new (where=(missing(col1)) keep=_name_ col1);
by id;
var _all_;
run;
options missing='.';
proc freq data=new order=freq;
table _name_ / nocum nopercent;
run;
some results ...
_NAME_ Frequency
nn314 5266
cc198 5230
nn124 5213
nn511 5185
cc182 5178
nn392 5169
nn3 5159
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Building off of MikeZdeb's solution, just wanted to see the performance trade-off between PROC TRANSPOSE and the good old DATA STEP. This was run on an old Win XP machine with PC SAS 9.1.3 (stop laughing!) with 10,000 records and 800 variables...
21 options missing='';
22 proc transpose data=x out=new (where=(missing(col1)) keep=_name_ col1);
23 by id;
24 var _all_;
25 run;
NOTE: Numeric variables in the input data set will be converted to character in the output data set.
NOTE: There were 10000 observations read from the data set WORK.X.
NOTE: The data set WORK.NEW has 399281 observations and 2 variables.
NOTE: PROCEDURE TRANSPOSE used (Total process time):
real time 2.42 seconds
cpu time 2.37 seconds
26
27 data new1(keep=__var);
28 set x;
29 array __num(*) _numeric_;
30 array __chr(*) _character_;
31
32 length __var $ 32;
33
34 do i=1 to hbound(__num,1);
35 if missing(__num(i)) then do;
36 __var = vname(__num(i));
37 output;
38 end;
39 end;
40
41 do i=1 to hbound(__chr,1);
42 if missing(__chr(i)) then do;
43 __var = vname(__chr(i));
44 output;
45 end;
46 end;
47 run;
NOTE: There were 10000 observations read from the data set WORK.X.
NOTE: The data set WORK.NEW1 has 399281 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 0.38 seconds
cpu time 0.39 seconds
DATA step runs considerably faster, but took more time to code. Is it clearer or more confusing than PROC TRANSPOSE is a matter of personal taste.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Here's a PROC FREQ approach that does only one pass through the data.
proc format ;
value miss_n low-high = 'N' other = 'Y' ;
value $miss_c ' ' = 'Y' other = 'N' ;
run ;
ods listing close ;
ods output OneWayFreqs=owf ;
proc freq data=x nlevels ;
format _numeric_ miss_n.
_character_ $miss_c. ;
tables _all_ / missing ;
run ;
ods output close ;
ods listing ;
data want (keep = Table Frequency) ;
set owf ;
if cats(of F_:) EQ 'Y' ;
table = substr( table , index(table,' ')+1 ) ;
run ;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
It is very interesting. Although in the documentation , it said that nmiss() function is counted for numeric value,But I
found nmiss() is validated for both numeric and character in SQL.
data temp; set sashelp.class; if ranuni(0) le .4 then call missing(weight); if ranuni(0) ge .6 then call missing(name); run; proc sql noprint; select 'nmiss('||strip(name)||') as '||strip(name) into : list separated by ',' from dictionary.columns where libname='WORK' and memname='TEMP'; create table have as select &list from temp; quit; proc transpose data=have out=want prefix=missing_value; run;
Ksharp
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi All,
Thanks a lot fo the replies, especially MikeZDeb - that was realy useful!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
proc format;
invalue blahfmt (fuzz=.09999999999)
.1 = .A
.2 = .B
.3 = .C
.4 = .D
.5 = .E
other = .;
run;
data _null_;
nmiss=0;
do i=1 to 1000;
blah=input(ranuni(1234),blahfmt.);
if blah=. then nmiss+1;
end;
putlog nmiss=;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I would have thought that Howard's offering would be the quickest, but it ended up (at least in my tests) to be the slowest. The winner, I think, is Ksharp's proc sql offering and by a fairly large margin.