12-18-2015 11:06 AM
I do a large amount of research suvey analysis. Often (too often), we get responses which are invalid. By invalid, I mean they
show up as numbers rather than the formatted value in PROC FREQ.
Here is an example :
format fclass 1 = 'Freshman'
2 = 'Sophomore'
3 = 'Junior'
4 = 'Senior';
This format is applied to a numeric variable which should have values between 1 and 4 :
format class fclass.;
What ultimately happens is that respondents code numbers outside the allowable range (as in the proc format statement) :
Running PROC FREQ for this variable, I get :
Is there an easy way to automatically set the erroneous unformatted values (i.e. 5 and 6 in this example) to missing numeric values?
What I've been doing is manually programming blocks of code for every question that has one or more invalid values :
if class GT 4 then class=.;
Is there an easier way to do this ? Survey respondents tend to code copious amounts of invalid answers sometimes...
Thanks in advance.
12-18-2015 11:11 AM
You can set an Other in the format, i.e.
proc format; format fclass 1 = 'Freshman' 2 = 'Sophomore' 3 = 'Junior' 4 = 'Senior' other = 'Other' run;
Or you can set the label to a missing character, whatever fits your need best.
12-18-2015 11:16 AM
Two basic approaches depending on how concerned anyone is about out of specified range values.
First is you use an informat that matches your expected values, for your example:
1,2,3,4 = _same_
other = .
And read the data with that informat. If you have an Excel file, save to CSV and read that to have control.
One advantage of the informat approach is that surveys often have many questions with the same coding schemes and you can use the same format for all of the questions with the same scheme.
12-18-2015 11:37 AM
This is probably a good time for you to learn about the advanced features of PROC SUMMARY. This example uses the EXCLUSIVE and PRELOADFMT options to achieve the result using only subset implied by the VALUE format. Note also the use of option COMPLETETYPES I'll leave it to use to research what that option does when you remove the "*" from line three of the program.
data class; input class freq @@; *if class eq 2 then delete; cards; 1 4 2 11 3 23 4 13 5 1 6 1 ;;;; run; proc print; run; proc format; value class 1='Freshman' 2='Sophomore' 3='Junior' 4='Senior'; run; proc summary data=class nway completetypes; class class / exclusive preloadfmt; freq freq; format class class.; output out=counts(drop=_type_); run; proc print; run; proc freq; tables class / nocum; weight _freq_; run;