I have a dataset with several categorical variables. I'd like to produce chi square tests on some of the variables. One of the variables, race, has several levels:
The level 'unknown' cannot be included in the chi square test, but must be included in the total population of interest, which is ~85,000.
I've tried the following code:
proc format; value $plan 'var1','var2'='var1+2' 'var3' 'var4' 'var5'='var3-5' run;
proc freq data=popstudy; tables (race sex age)*(var1) / chisq; format var1 $plan.; where race not= ('Unknown') ; run;
This reduces the total population to ~60,000, which is not what I need.
I've also tried subsetting the dataset and producing another dataset that removes all values of 'Unknown' race, but this reduces the population as the code above does and is not correct.
My supervisor says I need to rename the 'Unknown' level as equal to 99999 or something similar, and include a "where" statement in my proc freq statement, but I am not sure what they mean by this.
First of all, what do you think this statement does:
where race not= ('Unknown') ;
You aren't going to get a chi-sqr with a total cell count for anything when you reduce the level of any variable or have missing values for the variable. Just does not work that way. Chi-sqr uses counts. If there is nothing to count then the count is not done for the combination of that variable and the other.
If your N must be 85,000 in the calculations using race then you need to include the "unknown" in one or more ways. Simplest, include 'unknown' as a value. Another is to "impute" values for race but if you have a category missing for nearly 30% of your records I would be very leery of that.
I think you need to discuss this requirement with your instructor a bit more. The approach you sort of outline only replaces 'unknown' with '99999', so doesn't help and any where is going to reduce your n.
You could set race to SAS missing value instead of "unknown" but that will still be reducing the n for tables involving the race variable. However that would not reduce the n for the statistics involving age or sex.
What specific question are you supposed to be answering with this chi-sqr?
Secure your spot at the must-attend AI and analytics event of 2024: SAS Innovate 2024! Get ready for a jam-packed agenda featuring workshops, super demos, breakout sessions, roundtables, inspiring keynotes and incredible networking events.
Register by March 1 to snag the Early Bird rate of just $695! Don't miss out on this exclusive offer.