@Steelers_In_DC's solution could be modified as follows to work also for character variables which do not contain "numeric" values:
However, I have never tried to create a dataset with 14 million variables (as TRAN would be).
The code I have works with numeic and character fields. In the example I use both.
@Steelers_In_DC: Yes, you have both, but they all contain only "numeric" values, i.e. values which SAS can convert automatically into numeric values when it applies the NMISS function. If you replace one of the values by a text string, such as 'TEST', a note on "Invalid numeric data" will occur in the log and that text string will be incorrectly counted as missing (namely the missing numeric value resulting from the failed attempt to convert 'TEST' into a number).
The CMISS function can handle both numeric and character values. But since the PROC TRANSPOSE step creates only character variables COL1, COL2, ..., it would count the periods '.' from the numeric missings as non-missing (character strings). This is avoided by changing the MISSING system option as suggested. Special missing values would be counted as non-missing, though.
The NLEVELS of Proc FREQ does not exactly do what you need. Had another go using the Hash object in the DATA Step.
Have a look
data have;
infile cards dlm="," dsd;
input
RAWREG
PVPRISM $
OVRIDE3 $
DAYS
;
cards;
01,0939,,408
,0943,,379
09,0029,,170
11,0948,,
,0948,,59
04,6108,,57
06,0993,,59
06,6111,,59
06,0993,,220
04,0901,,59
,,3,
;
data _null_;
set have end=last;
array _xchar{*} _character_;
array _xnum{*} _numeric_;
if _n_ = 1 then do;
length
_colName $ 32
_nMiss 8
_n 8
;
declare hash hmiss( ordered: "Y");
hmiss.defineKey("_colName");
hmiss.defineData("_colName", "_n", "_nMiss");
hmiss.defineDone();
end;
do i = 1 to dim(_xchar);
_nMiss=0;
_n=0;
_colName = vname( _xchar{i} );
_rc = hmiss.find();
_nMiss + (missing(_xchar{i}) = 1);
_n + (missing(_xchar{i}) = 0);
if _rc = 0 then do;
hmiss.replace();
end;
else do;
hmiss.add();
end;
end;
do i = 1 to dim(_xnum);
_nMiss=0;
_n=0;
_colName = vname( _xnum{i} );
_rc = hmiss.find();
_nMiss + (missing(_xnum{i}) = 1);
_n + (missing(_xnum{i}) = 0);
if _rc = 0 then do;
hmiss.replace();
end;
else do;
hmiss.add();
end;
end;
if last = 1 then do;
hmiss.output(dataset: "hashmiss");
end;
run;
proc print data=hashmiss;
run;
proc print data=have;
run;
Bruno
Bruno, I have no idea what half your code means, but I'm runing it now. So far no errors! I will let you know how it works out.
Thanks!!
Here's the code I had before with Reinhard's correction, learn something everyday:
data have;
infile cards dsd;
input RAWREG$ PVPRISM$ OVRIDE3 DAYS;
cards;
01,0939,,408
,0943,,379
09,0029,,170
11,0948,,
,0948,,59
04,6108,,57
06,0993,,59
06,6111,,59
06,0993,,220
04,0901,,59
aa,aaaa,,5
.,.,,5
;
proc sql;
create table get_max as
select *,count(*) as max_obs
from have;
options missing=' ';
proc transpose data=get_max out=tran;by max_obs;var _all_;
data want (keep=Variable missing non_missing);
set tran(rename=(_name_ = Variable));
Missing=cmiss(of col: );
Non_Missing=max_obs - missing;
if variable in ('max_obs') then delete;
run;
proc sort data=want;by Variable;
@BrunoMueller Thank you! Your code worked perfectly and exactly what I needed. I took about 45 min for 858 variables and over 4mill observations. Not bad!!
You are welcome
I am sure the code can be optimized to go faster.
The first optimization would be to let Proc MEANS/SUMMARY do the counting for all the numeric variables and have the DATA Step code only for the character variables, this will for sure speed up things. See the code sample provided by @PaigeMiller
Give it a try.
Bruno
I feel a need to jump back in, to create something simple and practical here.
First, note that PROC FREQ with a format would be the easiest program. It does create a separate table for each variable (which you didn't really want), but the tradeoff is that it is quick and easy.
Given that you have a large data set, here is a reasonable approach.
First, count how many character variables you have. (That's the complex part.)
data _null_;
set have;
array chars {*} _character_;
call symputx ('n_chars', dim(chars));
stop;
run;
Then use that number in the coding.
data _null_;
set have end=done nobs=_nobs_;
array chars {&n_chars} _character_;
array nonmiss {&n_chars};
do _n_=1 to &n_chars;
if chars{_n_}>' ' then nonmiss{_n_} + 1;
end;
if done;
file print;
put 'Variable' @35 'N missing' @47 'N nonmissing' / 60 * '-';
length varname $32;
do _n_=1 to &n_chars;
varname = vname(chars{_n_});
n_nonmissing = max(0, nonmiss{_n_});
n_missing = _nobs_ - n_nonmissing;
put varname @35 n_missing comma12. @47 n_nonmissing comma12.;
end;
run;
You process your large data set only once, you don't need to do any expensive transformations with it, and you don't even need to save the results ... the report is generated within the same DATA step that does the counting.
The code is untested, so you might want to test it (possibly debug it) with a small number of observations.
Good luck.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.