BookmarkSubscribeRSS Feed
SAS-questioner
Obsidian | Level 7

Currently, I have like 20 categorical variables (X1 to X20) and one binary outcome variable (Y). Firstly, I used the chi-square test to check the association between X1 to X20 and Y one by one. And then I use proc logistic regression and stepwise selection to select the important predictor and examine the impact of X1 - X20 on Y in a multivariate way. However, after I conducted the logistic regression, the SAS said I don't have a valid observations. Because the large amount of missing values, after I put X1 -X20 together, no observations were found. The missing rate of X1-X20 ranging from 20% to 94%.

 

Maybe I should try to impute the missing values using proc mi, but what I can find are all for continuous variables.

 

Do you think I should just stop at the chi-square test one by one? Or I should check the missing patterns firstly, and try to impute the missing values. If I should try to impute the missing values, does anyone know how to examine the missing pattern and impute for categorical variables? Thank you!

15 REPLIES 15
SAS_Rob
SAS Employee
I think most statisticians would question the usefulness of using a variable as a predictor that has 94% of missing values.
That being said, if the goal is to use model selection, then multiple imputation isn’t really an option. Because it creates multiple imputed data sets it is probable that you will get different models selected for some of the data sets. This means you will not be able to combine the results for a single set of parameters.
If you want to impute categorical variables, then you can use the DISCRIM or LOGISTIC methods on either the FCS or MONOTONE statements in Proc MI. This section of the documentation might be helpful:
https://documentation.sas.com/doc/en/statug/15.2/statug_mi_details05.htm
SAS-questioner
Obsidian | Level 7
Thank you for the reply! But there are many predictors, do you think I should stop at the chi-square test step? Or I should try to put all predictors in the model and use the method that you suggest?
ballardw
Super User

What exactly do your X1 through X20 represent?

SAS-questioner
Obsidian | Level 7
Thank you for the reply, all those variables are survey item with yes/no, or other 5 category-options.
ballardw
Super User

@SAS-questioner wrote:
Thank you for the reply, all those variables are survey item with yes/no, or other 5 category-options.

If the "missing" results are from skip patterns in the survey questions, i.e. if question 1 is answered no (or yes) then "skip" question 2 then you have issues with dependency which may cause regression problems. Plus they are a known missing cause so could have a special category assigned to handle that conditionality.

 

And what kind of sample design was used in the survey? If the sample design is complex, such as a stratified sample, then you should be using the survey procedures for analysis to correctly use any weights assigned.

SAS-questioner
Obsidian | Level 7
There is no skip patterns in the survey questions. They should answer all questions, but I don't know why there are so many missing items. It's simple design.
Reeza
Super User
I would combine the approaches here.
Drop rows with missing more than 80% and drop columns with more than 80% and see where that leave you.

You'll need to do this both ways I suspect.

Also, examine why there are so many missing, are they missing at random or systemic?

https://blogs.sas.com/content/iml/2016/04/18/patterns-of-missing-data-in-sas.html
SAS-questioner
Obsidian | Level 7

Hi, Reeza, thank you for the reply. So, I don't need to stop at the chi-square step. I can try to drop the rows and columns with missing more than 80% and test the rest of data with logistic regression right?

 

Yeah, actually I found the link you put there, but when I test with the proc mi code in there, I need to put class statement and FCS or monotone, in this case can I still examine the pattern of the missing? 

SAS-questioner
Obsidian | Level 7
Thank you for the reply, I tried to check the missing pattern by using the procedure listed in the blog, however, SAS kept saying error. Firstly, I need to put "class". Then I need to put "FCS" or "MONOTONE" statement. After I put "FCS", SAS said, there is no continuous variables in the VAR list to impute the variable with FCS methods. It seems like PROC MI doesn't work on all categorical variables?
SAS_Rob
SAS Employee
I assume that the ERROR message you are receiving is the following:
ERROR: The CLASS variables cannot be used as covariates in an FCS discriminant method with the default CLASSEFFECT=EXCLUDE option.
The solution is to use the CLASSEFFECTS=INCLUDE option on the FCS DISCRIM statement.
https://go.documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/statug/statug_mi_syntax05.htm#statug.mi.fc...
Rick_SAS
SAS Super FREQ

You might try to use a decision tree instead of logistic regression. Logistic regression drops the entire observation if any variable is missing. A decision tree doesn't. See the PROC HPSPLIT documentation at https://documentation.sas.com/doc/en/statug/15.2/statug_hpsplit_examples01.htm

 

Ksharp
Super User

And also you could try Partial Least Square Regression (PROC PLS) also could handle/impute missing value and get importance of variables.

 

proc pls data=class  missing=em   nfac=2 plot=(ParmProfiles VIP) details; * cv=split  cvtest(seed=12345);
 class sex;
 model age=weight height sex;
* output out=x predicted=p;
run;
SAS-questioner
Obsidian | Level 7
Does it handle the binary outcome?

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 15 replies
  • 769 views
  • 7 likes
  • 6 in conversation