Hi,
I am conducting a multinomial regression analysis with the outcome variable being membership class. The membership classes include group 1 (high trajectory over time, n=170), group 2 (increasing, n=80), and group 3 (low, n=20).
My aim is to identify the risk factors associated with being in the low trajectory group. I have 10 variables to test, and due to the limited sample size in group 3, I have chosen to initially conduct univariate regression analyses for each risk factor and select those with a P-value <0.05 for inclusion in the multivariate regression model. However, one variable in group3 has a count of 0 in race categories, leading to extremely high OR, >999.999. I would like to seek advice on the following:
1. Should I directly exclude race variable from the model? Could I justify excluding it by stating that some race categories have zero counts, making them unusable in the analysis?
2. Given the imbalance in my data (170:80:20), could I weight group numbers in the model? The results look much better after weighting group. Here is my code:
proc logistic data=a;
class group (ref = "3") race (ref="White") inccat(ref=">$150,000")
hypertension_Disease (ref="0") diabetes_Disease(ref="0");
model group = race inccat hypertension_Disease diabetes_Disease/link = glogit
aggregate scale=none;
weight group;
run;
3.I also tried oversampling, but I found that it might yield inaccurate results, so I'm hesitant to use this method.
Here is my code:
data have;
set a;
run;
proc surveyselect data=have out=want sampsize=(170 80 80) method=urs outhits;
strata group;
run;
Any comments are appreciated, thanks!
As a general comment, a number of SAS procedures have built-in model selection capabilities. In your case, take a look at PROC HPGENSELECT. It's possible that you will find a more optimal model-selection method there than filtering based on p-values.
Regardless, for your first question, if you have reason to believe that race has a strong association with your outcome, I would try to keep it in the model. You could consider collapsing some of the categories (e.g., white vs. not). Alternatively, you could consider collapsing the three membership classes into 2, for instance, high trajectory vs. not.
For your second question, don't weight the analysis unless you have a specific reason to do so. For example, if your data are from a survey and there are survey weights based on the sampling approach. Likewise, I would advise against using the 'oversampling' procedure that you are showing; duplicating observations from the 'low' category so that you artificially have a larger sample size will not generalize well.
As a general comment, a number of SAS procedures have built-in model selection capabilities. In your case, take a look at PROC HPGENSELECT. It's possible that you will find a more optimal model-selection method there than filtering based on p-values.
Regardless, for your first question, if you have reason to believe that race has a strong association with your outcome, I would try to keep it in the model. You could consider collapsing some of the categories (e.g., white vs. not). Alternatively, you could consider collapsing the three membership classes into 2, for instance, high trajectory vs. not.
For your second question, don't weight the analysis unless you have a specific reason to do so. For example, if your data are from a survey and there are survey weights based on the sampling approach. Likewise, I would advise against using the 'oversampling' procedure that you are showing; duplicating observations from the 'low' category so that you artificially have a larger sample size will not generalize well.
Concurring with @Mike_N here.
Two more remarks :
Be aware that there are multiple analysis methods available for modelling a multinomial target besides a logistic regression with glogit link. For example and among others :
Koen
I am able to run PROC HPGENSELECT on either SAS 9.4 or SAS Studio. For some example code that I can run successfully, see here: https://go.documentation.sas.com/doc/en/statcdc/14.3/stathpug/stathpug_code_hpgenex1.htm .
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.