BookmarkSubscribeRSS Feed
bolore
Calcite | Level 5

Hi

Hope you can help.

I have a dataset that has about 800 variables and 100,000 rows.  I want to find the number of missing values for each variable (column).  I don't want a proc frequency table for each of the variables.  I would like a table output with one column for variable and another column for the number of missing values for each variable.

eg.

variable name          # missing values

var 1                               3000

var 2                               80

var 3                              10000

...

var 800                    5

Thanks

Bo

11 REPLIES 11
Reeza
Super User

In two steps using a proc mean and then transposing the results. This assumes you have no character variables. If you do, then a diff solution is needed.

proc means data=sashelp.class noprint;

var age height weight;

output out=try1 (drop= _type_ _freq_) nmiss=;

run;

proc transpose data=try1 out=try2 prefix=nmiss;

run;

MikeZdeb
Rhodochrosite | Level 12

Hey Art ... pretty sure  PROC MEANS only accepts numeric variables in the VAR statement.  I get this with your example ...

183  proc means data=have noprint;

184    var _all_;

ERROR: Variable var2 in list does not match type prescribed for this list.

185    output out=temp (drop= _type_ _freq_) nmiss=;

186  run;

How about (assumes there is a variable that uniquely identifies each observation ... here it's NREC) ...

data x;

input name : $10. age zip : $5. salary;

nrec+1;

datalines;

mike 64 12203 9999

sara . 20001 .

. 24 20001 10000

jessica . 90124 999

;

run;

options missing='';

proc transpose data=x out=new (where=(missing(col1)) keep=_name_ col1);

by nrec;

var _all_;

run;

options missing='.';

proc freq data=new;

table _name_ / nocum nopercent;

run;

_NAME_    Frequency

age              2

name             1

salary           1


art297
Opal | Level 21

Mike,

You are absolutely, totally correct and I am deleting my previous response.  You would think after all of these years I'd have learned to look at my log before posting Smiley Happy

MikeZdeb
Rhodochrosite | Level 12

Hi ... I guess it's "log before you leap".

MikeZdeb
Rhodochrosite | Level 12

Hi ... closer to the data set in the original posting ... took about 25 seconds CPU (elapsed about the same) on a not too fast PC.

* fake data ... 100,000 obs / 600 numeric + 200 character vars

data x;

array nn(600);

array cc(200) $2.;

do id=1 to 1e5;

do k=1 to 200;

   nn(k) = ifn(ranuni(999) le .05 , . , 10);

   cc(k) = ifc(ranuni(999) le .05 , ' ' , '10');

end;

do k=201 to 600;

   nn(k) = ifn(ranuni(999) le .05 , . , 10);

end;

output;

end;

keep id nn: cc: ;

run;

options missing='';

proc transpose data=x out=new (where=(missing(col1)) keep=_name_ col1);

by id;

var _all_;

run;

options missing='.';

proc freq data=new order=freq;

table _name_ / nocum nopercent;

run;

some results ...

_NAME_    Frequency

nn314         5266

cc198         5230

nn124         5213

nn511         5185

cc182         5178

nn392         5169

nn3           5159


DLing
Obsidian | Level 7

Building off of MikeZdeb's solution, just wanted to see the performance trade-off between PROC TRANSPOSE and the good old DATA STEP.  This was run on an old Win XP machine with PC SAS 9.1.3 (stop laughing!) with 10,000 records and 800 variables...

21   options missing='';

22   proc transpose data=x out=new (where=(missing(col1)) keep=_name_ col1);

23       by id;

24       var _all_;

25   run;

NOTE: Numeric variables in the input data set will be converted to character in the output data set.

NOTE: There were 10000 observations read from the data set WORK.X.

NOTE: The data set WORK.NEW has 399281 observations and 2 variables.

NOTE: PROCEDURE TRANSPOSE used (Total process time):

      real time           2.42 seconds

      cpu time            2.37 seconds

26

27   data new1(keep=__var);

28       set x;

29       array __num(*) _numeric_;

30       array __chr(*) _character_;

31

32       length __var $ 32;

33

34       do i=1 to hbound(__num,1);

35           if missing(__num(i)) then do;

36               __var = vname(__num(i));

37               output;

38           end;

39       end;

40

41       do i=1 to hbound(__chr,1);

42           if missing(__chr(i)) then do;

43               __var = vname(__chr(i));

44               output;

45           end;

46       end;

47   run;

NOTE: There were 10000 observations read from the data set WORK.X.

NOTE: The data set WORK.NEW1 has 399281 observations and 1 variables.

NOTE: DATA statement used (Total process time):

      real time           0.38 seconds

      cpu time            0.39 seconds

DATA step runs considerably faster, but took more time to code.  Is it clearer or more confusing than PROC TRANSPOSE is a matter of personal taste.

Howles
Quartz | Level 8

Here's a PROC FREQ approach that does only one pass through the data.

proc format ;

value  miss_n low-high = 'N' other = 'Y' ;

value $miss_c      ' ' = 'Y' other = 'N' ;

run ;

ods listing close ;

ods output OneWayFreqs=owf ;

proc freq data=x nlevels ;

format _numeric_    miss_n.

       _character_ $miss_c. ;

tables _all_ / missing ;

run ;

ods output close ;

ods listing ;

data want (keep = Table Frequency) ;

set owf ;

if cats(of F_:) EQ 'Y' ;

table = substr( table , index(table,' ')+1 ) ;

run ;

Ksharp
Super User

It is very interesting. Although in the documentation , it said that nmiss() function is counted for numeric value,But I

found nmiss() is validated for both numeric and character in SQL.

data temp;
 set sashelp.class;
 if ranuni(0) le .4 then call missing(weight);
 if ranuni(0) ge .6 then call missing(name);
run;
proc sql noprint;
 select 'nmiss('||strip(name)||') as '||strip(name) 
  into : list separated by ','
   from dictionary.columns
    where libname='WORK' and memname='TEMP';

 create table have as
  select &list
   from temp;
quit;

proc transpose data=have out=want prefix=missing_value;
run;

Ksharp

bolore
Calcite | Level 5

Hi All,

Thanks a lot fo the replies, especially MikeZDeb - that was realy useful!

FriedEgg
SAS Employee

proc format;

invalue blahfmt (fuzz=.09999999999)

  .1 = .A

  .2 = .B

  .3 = .C

  .4 = .D

  .5 = .E

  other = .;

run;

data _null_;

nmiss=0;

do i=1 to 1000;

  blah=input(ranuni(1234),blahfmt.);

  if blah=. then nmiss+1;

end;

putlog nmiss=;

run;

art297
Opal | Level 21

I would have thought that Howard's offering would be the quickest, but it ended up (at least in my tests) to be the slowest.  The winner, I think, is Ksharp's proc sql offering and by a fairly large margin.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 1836 views
  • 3 likes
  • 8 in conversation