BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
GiaLee
Obsidian | Level 7

Hi,

 

I am conducting a multinomial regression analysis with the outcome variable being membership class. The membership classes include group 1 (high trajectory over time, n=170), group 2 (increasing, n=80), and group 3 (low, n=20).

My aim is to identify the risk factors associated with being in the low trajectory group. I have 10 variables to test, and due to the limited sample size in group 3, I have chosen to initially conduct univariate regression analyses for each risk factor and select those with a P-value <0.05 for inclusion in the multivariate regression model. However, one variable in group3 has a count of 0 in race categories, leading to extremely high OR, >999.999. I would like to seek advice on the following:

 

1. Should I directly exclude race variable from the model? Could I justify excluding it by stating that some race categories have zero counts, making them unusable in the analysis? 

Screen Shot 2024-02-19 at 01.10.28.png

Screen Shot 2024-02-19 at 00.51.11.png

2. Given the imbalance in my data (170:80:20), could I weight group numbers in the model? The results look much better after weighting group. Here is my code:

 

proc logistic data=a;
class group (ref = "3") race (ref="White") inccat(ref=">$150,000")
hypertension_Disease (ref="0") diabetes_Disease(ref="0");
model group = race inccat hypertension_Disease diabetes_Disease/link = glogit
aggregate scale=none;

weight group;
run;

 

3.I also tried oversampling, but I found that it might yield inaccurate results, so I'm hesitant to use this method.

Here is my code: 

 

data have;
set a;
run;

proc surveyselect data=have        out=want sampsize=(170 80 80)          method=urs outhits;
strata group;
run;

 

Any comments are appreciated, thanks! 

 
1 ACCEPTED SOLUTION

Accepted Solutions
Mike_N
SAS Employee

As a general comment, a number of SAS procedures have built-in model selection capabilities. In your case, take a look at PROC HPGENSELECT. It's possible that you will find a more optimal model-selection method there than filtering based on p-values. 

 

Regardless, for your first question, if you have reason to believe that race has a strong association with your outcome, I would try to keep it in the model. You could consider collapsing some of the categories (e.g., white vs. not). Alternatively, you could consider collapsing the three membership classes into 2, for instance, high trajectory vs. not. 

 

For your second question, don't weight the analysis unless you have a specific reason to do so. For example, if your data are from a survey and there are survey weights based on the sampling approach.  Likewise, I would advise against using the 'oversampling' procedure that you are showing; duplicating observations from the 'low' category so that you artificially have a larger sample size will not generalize well.   

View solution in original post

7 REPLIES 7
Mike_N
SAS Employee

As a general comment, a number of SAS procedures have built-in model selection capabilities. In your case, take a look at PROC HPGENSELECT. It's possible that you will find a more optimal model-selection method there than filtering based on p-values. 

 

Regardless, for your first question, if you have reason to believe that race has a strong association with your outcome, I would try to keep it in the model. You could consider collapsing some of the categories (e.g., white vs. not). Alternatively, you could consider collapsing the three membership classes into 2, for instance, high trajectory vs. not. 

 

For your second question, don't weight the analysis unless you have a specific reason to do so. For example, if your data are from a survey and there are survey weights based on the sampling approach.  Likewise, I would advise against using the 'oversampling' procedure that you are showing; duplicating observations from the 'low' category so that you artificially have a larger sample size will not generalize well.   

sbxkoenk
SAS Super FREQ

Concurring with @Mike_N here.

 

Two more remarks :

  • The smallest group size sets the limit for the number of covariates that you can include in model without over-fitting. So, for a group size of 20, you might be able to have 1 or 2 covariates in the model (3 seems already exaggerated to me!).
  • The univariate "screening" of covariates before entering them into the multivariable logistic model (whether binary, ordinal, or multinomial) is to be avoided. More optimal model selection methods exist in PROC (HP)GENSELECT.

Be aware that there are multiple analysis methods available for modelling a multinomial target besides a logistic regression with glogit link. For example and among others : 

  • Multiple-group discriminant function analysis: A multivariate method for multinomial outcome variables

Koen

GiaLee
Obsidian | Level 7
Thank you very much!
I haven't heard Multiple-group discriminant function analysis before. I'll take a look. Thanks!
GiaLee
Obsidian | Level 7
Thank you!
Is PROC HPGENSELECT only accessible in SAS 9.4 version? I tried with it in SAS Studio and got the error message "The data set test must utilize a CAS engine libref."
Mike_N
SAS Employee

I am able to run PROC HPGENSELECT on either SAS 9.4 or SAS Studio. For some example code that I can run successfully, see here: https://go.documentation.sas.com/doc/en/statcdc/14.3/stathpug/stathpug_code_hpgenex1.htm .

GiaLee
Obsidian | Level 7
Thank you very much! It works!
GiaLee
Obsidian | Level 7
I have a few additional questions regarding this model:
1. Odds ratio calculation: I ran a multinomial model, I used PROC HPGENSELECT and PROC HPLOGISTIC, but neither provided the odds ratio. Both procedures generated same parameter estimates. Could I calculate the odds ratios and CI by exp the parameter estimates?

2. Missing data: Is it possible to use multiple imputation with PROC HPGENSELECT? I attempted to summarize the results using PROC MINIMIZE but failed.

3. Model selection: After using PROC HPGENSELECT to select two variables, could I choose these variables and conduct a standard logistic regression analysis? This would allow me to easily obtain odds ratios and impute missing data.

Best,
Gia

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 7 replies
  • 1258 views
  • 4 likes
  • 3 in conversation