Solved: Count missing values of 100+ variables in with column output - Page 2

FreelanceReinh · Posted 11-18-2015 02:51 PM

@Steelers_In_DC's solution could be modified as follows to work also for character variables which do not contain "numeric" values:

Insert options missing=' '; before the PROC TRANSPOSE step.
Replace nmiss by cmiss in the last data step.

However, I have never tried to create a dataset with 14 million variables (as TRAN would be).

Steelers_In_DC · Posted 11-18-2015 02:58 PM

The code I have works with numeic and character fields. In the example I use both.

FreelanceReinh · Posted 11-18-2015 03:44 PM

@Steelers_In_DC: Yes, you have both, but they all contain only "numeric" values, i.e. values which SAS can convert automatically into numeric values when it applies the NMISS function. If you replace one of the values by a text string, such as 'TEST', a note on "Invalid numeric data" will occur in the log and that text string will be incorrectly counted as missing (namely the missing numeric value resulting from the failed attempt to convert 'TEST' into a number).

The CMISS function can handle both numeric and character values. But since the PROC TRANSPOSE step creates only character variables COL1, COL2, ..., it would count the periods '.' from the numeric missings as non-missing (character strings). This is avoided by changing the MISSING system option as suggested. Special missing values would be counted as non-missing, though.

BrunoMueller · Posted 11-18-2015 02:04 PM

The NLEVELS of Proc FREQ does not exactly do what you need. Had another go using the Hash object in the DATA Step.

Have a look

data have;
  infile cards dlm="," dsd;
  input
    RAWREG 
    PVPRISM $
    OVRIDE3 $
    DAYS 
  ;
cards;
01,0939,,408
,0943,,379
09,0029,,170
11,0948,,
,0948,,59
04,6108,,57
06,0993,,59
06,6111,,59
06,0993,,220
04,0901,,59
,,3,
;
data _null_;
  set have end=last;
  array _xchar{*} _character_;
  array _xnum{*} _numeric_;

  if _n_ = 1 then do;
    length 
      _colName $ 32
      _nMiss 8
      _n 8
    ;
    declare hash hmiss( ordered: "Y");
    hmiss.defineKey("_colName");
    hmiss.defineData("_colName", "_n", "_nMiss");
    hmiss.defineDone();
  end;

  do i = 1 to dim(_xchar);
    _nMiss=0;
    _n=0;
    _colName = vname( _xchar{i} );
    _rc = hmiss.find();
    _nMiss + (missing(_xchar{i}) = 1);
    _n + (missing(_xchar{i}) = 0);

    if _rc = 0 then do;
      hmiss.replace();
    end;
    else do;
      hmiss.add();
    end;
  end;

  do i = 1 to dim(_xnum);
    _nMiss=0;
    _n=0;
    _colName = vname( _xnum{i} );
    _rc = hmiss.find();
    _nMiss + (missing(_xnum{i}) = 1);
    _n + (missing(_xnum{i}) = 0);

    if _rc = 0 then do;
      hmiss.replace();
    end;
    else do;
      hmiss.add();
    end;

  end;

  if last = 1 then do;
    hmiss.output(dataset: "hashmiss");
  end;
run;

proc print data=hashmiss;
run;

proc print data=have;
run;

Bruno

jenim514 · Posted 11-18-2015 04:27 PM

Bruno, I have no idea what half your code means, but I'm runing it now. So far no errors! I will let you know how it works out.

Thanks!!

Steelers_In_DC · Posted 11-18-2015 04:32 PM

Here's the code I had before with Reinhard's correction, learn something everyday:

data have;
infile cards dsd;
input RAWREG$ PVPRISM$ OVRIDE3 DAYS;
cards;
01,0939,,408
,0943,,379
09,0029,,170
11,0948,,
,0948,,59
04,6108,,57
06,0993,,59
06,6111,,59
06,0993,,220
04,0901,,59
aa,aaaa,,5
.,.,,5
;

proc sql;
create table get_max as
select *,count(*) as max_obs
from have;

options missing=' ';
proc transpose data=get_max out=tran;by max_obs;var _all_;

data want (keep=Variable missing non_missing);
set tran(rename=(_name_ = Variable));
Missing=cmiss(of col: );
Non_Missing=max_obs - missing;
if variable in ('max_obs') then delete;
run;

proc sort data=want;by Variable;

jenim514 · Posted 11-19-2015 10:45 AM

@BrunoMueller Thank you! Your code worked perfectly and exactly what I needed. I took about 45 min for 858 variables and over 4mill observations. Not bad!!

BrunoMueller · Posted 11-19-2015 11:07 AM

You are welcome

I am sure the code can be optimized to go faster.

The first optimization would be to let Proc MEANS/SUMMARY do the counting for all the numeric variables and have the DATA Step code only for the character variables, this will for sure speed up things. See the code sample provided by @PaigeMiller

Give it a try.

Bruno

Astounding · Posted 11-18-2015 07:03 PM

I feel a need to jump back in, to create something simple and practical here.

First, note that PROC FREQ with a format would be the easiest program. It does create a separate table for each variable (which you didn't really want), but the tradeoff is that it is quick and easy.

Given that you have a large data set, here is a reasonable approach.

First, count how many character variables you have. (That's the complex part.)

data _null_;

set have;

array chars {*} _character_;

call symputx ('n_chars', dim(chars));

stop;

run;

Then use that number in the coding.

data _null_;

set have end=done nobs=_nobs_;

array chars {&n_chars} _character_;

array nonmiss {&n_chars};

do _n_=1 to &n_chars;

if chars{_n_}>' ' then nonmiss{_n_} + 1;

end;

if done;

file print;

put 'Variable' @35 'N missing' @47 'N nonmissing' / 60 * '-';

length varname $32;

do _n_=1 to &n_chars;

varname = vname(chars{_n_});

n_nonmissing = max(0, nonmiss{_n_});

n_missing = _nobs_ - n_nonmissing;

put varname @35 n_missing comma12. @47 n_nonmissing comma12.;

end;

run;

You process your large data set only once, you don't need to do any expensive transformations with it, and you don't even need to save the results ... the report is generated within the same DATA step that does the counting.

The code is untested, so you might want to test it (possibly debug it) with a small number of observations.

Good luck.

Re: Count missing values of 100+ variables in with column output

Re: Count missing values of 100+ variables in with column output

Re: Count missing values of 100+ variables in with column output

Re: Count missing values of 100+ variables in with column output

Re: Count missing values of 100+ variables in with column output

Re: Count missing values of 100+ variables in with column output

Re: Count missing values of 100+ variables in with column output

Re: Count missing values of 100+ variables in with column output

Re: Count missing values of 100+ variables in with column output

Registration is open

SAS Training: Just a Click Away