Currently, I have like 20 categorical variables (X1 to X20) and one binary outcome variable (Y). Firstly, I used the chi-square test to check the association between X1 to X20 and Y one by one. And then I use proc logistic regression and stepwise selection to select the important predictor and examine the impact of X1 - X20 on Y in a multivariate way. However, after I conducted the logistic regression, the SAS said I don't have a valid observations. Because the large amount of missing values, after I put X1 -X20 together, no observations were found. The missing rate of X1-X20 ranging from 20% to 94%.
Maybe I should try to impute the missing values using proc mi, but what I can find are all for continuous variables.
Do you think I should just stop at the chi-square test one by one? Or I should check the missing patterns firstly, and try to impute the missing values. If I should try to impute the missing values, does anyone know how to examine the missing pattern and impute for categorical variables? Thank you!
What exactly do your X1 through X20 represent?
@SAS-questioner wrote:
Thank you for the reply, all those variables are survey item with yes/no, or other 5 category-options.
If the "missing" results are from skip patterns in the survey questions, i.e. if question 1 is answered no (or yes) then "skip" question 2 then you have issues with dependency which may cause regression problems. Plus they are a known missing cause so could have a special category assigned to handle that conditionality.
And what kind of sample design was used in the survey? If the sample design is complex, such as a stratified sample, then you should be using the survey procedures for analysis to correctly use any weights assigned.
Hi, Reeza, thank you for the reply. So, I don't need to stop at the chi-square step. I can try to drop the rows and columns with missing more than 80% and test the rest of data with logistic regression right?
Yeah, actually I found the link you put there, but when I test with the proc mi code in there, I need to put class statement and FCS or monotone, in this case can I still examine the pattern of the missing?
You might try to use a decision tree instead of logistic regression. Logistic regression drops the entire observation if any variable is missing. A decision tree doesn't. See the PROC HPSPLIT documentation at https://documentation.sas.com/doc/en/statug/15.2/statug_hpsplit_examples01.htm
And also you could try Partial Least Square Regression (PROC PLS) also could handle/impute missing value and get importance of variables.
proc pls data=class missing=em nfac=2 plot=(ParmProfiles VIP) details; * cv=split cvtest(seed=12345);
class sex;
model age=weight height sex;
* output out=x predicted=p;
run;
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.
Early bird rate extended! Save $200 when you sign up by March 31.
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.