Hi SAS friends,
is there a way to run a frequency on a variable without displaying every level for that variable? For example i have a variable called Phone for which i have 1M observations. I don't want to know the number of occurrences of phone i just want to know if its missing. If i run the normal proc freq with the missing option it takes 20+minutes to get results.
I'm running SAS 9.4 with STAT.
Thank you,
data BigDS;
do i = 1 to 1000000;
PhoneNums=INT((RanUni(07022014)*10000000000));
MissRate=RanUni(1234);
if MissRate>=.8 then call missing(PhoneNums);
keep PhoneNums;
output;
end;
run;
proc freq data = BigDS;
table PhoneNums/missing;
run;
Outside of the SQL world, there are two standard ways to accomplish this. First, you could switch to PROC MEANS:
proc means data=have n nmiss;
var phonenums;
run;
The advantage here is that you can easily add many variables to the VAR statement.
To use PROC FREQ, apply a format that groups all nonmissing values:
proc format;
value present low-high='Present' other='Missing';
run;
Then use the format:
proc freq data=have;
tables phonenum / missing;
format phonenum present.;
run;
Only testing will tell you which is fastest.
You could do a proc sort with nodupkey. Or alternatively use:
proc sql;
create table WANT as
select count(1)
from HAVE
where PHONENUMS is null;
quit;
As for freq, I don't use it so don't know.
proc sql;
select nmiss(phonenums) from bigds;
quit;
Outside of the SQL world, there are two standard ways to accomplish this. First, you could switch to PROC MEANS:
proc means data=have n nmiss;
var phonenums;
run;
The advantage here is that you can easily add many variables to the VAR statement.
To use PROC FREQ, apply a format that groups all nonmissing values:
proc format;
value present low-high='Present' other='Missing';
run;
Then use the format:
proc freq data=have;
tables phonenum / missing;
format phonenum present.;
run;
Only testing will tell you which is fastest.
Astounding is there a low-high equivalent for character variables?
For character variables, the range low-high does not help because the hex code for a blank falls in the middle of the range of possible values. Instead, this format would do the trick:
proc format;
value $present ' '='Missing' other='Present';
run;
Your original program might also pay a penalty for creating large ODS output, depending on your settings. If running in SAS Enterprise Guide or in SAS Display Manager with default settings, you're creating a large HTML or SAS Report output. The creation/transfer of that file would slow down the overall process.
has good suggestions for how to reduce the "counting" part of the process, making it much more efficient. But your original approach would run pretty fast if you output just a data set, and not any report.
proc freq data = BigDS noprint;
table PhoneNums/missing out=miss_count;
run;
Or optionally, to declutter the output:
proc freq data = BigDS noprint;
table PhoneNums/missing out=miss_count(where=(PhoneNums=.));
run;
Chris
I have been using the following in my autoexec file for the last 12 years or so:
PROC FORMAT /*LIB=BBBANAL.SEGFORMATS*/ ;
VALUE MISSING .="Missing" Other="Not Missing" ;
VALUE $MISSING ' '="Missing" Other="Not Missing" ;
RUN ;
Thank you everyone for your input. i like asking these questions because i always learn multiple ways to accomplish something.
If you just want to know where there are ANY missing values then use the NLEVELS output from proc FREQ.
proc freq data = BigDS nlevels ;
table PhoneNums / noprint;
run;
Tom that is awesome! can you route the results of a Nlevels table to a dataset?
ODS OUTPUT.
But watch because if there are no missing values then the variable NMissLevels will not even be on the output dataset.
ods output nlevels=nlevels;
ods exclude nlevels;
proc freq data = BigDS nlevels ;
table PhoneNums / noprint;
run;
1722 data _null_; set nlevels; put (_all_) (=/); run;
TableVar=PhoneNums
NLevels=799696
NMissLevels=1
NNonMissLevels=799695
Tom, how does this produce a table? I'm also trying to save the value from running nlevels so I can merge it in with other data. I just need the one count variable produced with nlevels, but when I try to save the output it saves the underlying individual level data, not the nlevels value.
The ODS OUTPUT statement tells it to save the output that the NLEVELS option produces as a dataset.
The ODS EXCLUDE statement tells it to not produce any printed output for the NLEVELS option.
The NOPRINT option on the TABLES statement tells it not to produce any printed output for the variable frequencies.
So all that it produces is the dataset with the number of levels (and number of missing levels if any).
Hi ... if you have a number of variables (NUM, CHAR, or both types), you can try the method in this paper (it uses NLEVELS)...
An Easy Route to a Missing Data Report with ODS+PROC FREQ+A Data Step
http://www.lexjansen.com/nesug/nesug11/ds/ds12.pdf
Using your data step to create 3 numeric variables ...
data x;
do i = 1 to 1000000;
Phone1=INT((RanUni(07022014)*10000000000));
phone2 = phone1;
phone3 = phone2;
if RanUni(1234)>=.8 then call missing(Phone1);
if RanUni(1234)>=.9 then call missing(Phone2);
if RanUni(1234)>=.4 then call missing(Phone3);
output;
end;
keep Phone: ;
run;
CPU time was < 1 second to produce a report ...
Obs var miss p_miss ok p_ok
1 Phone1 199,849 20.0 800,151 80.0
2 phone2 99,747 10.0 900,253 90.0
3 phone3 600,976 60.1 399,024 39.9
You can also use PROC MI to look for patterns in missing numeric data if you have two or more numeric variables (takes longer than the NLEVELS approach) ...
proc mi data=x;
ods select misspattern;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.