Solved: Re: Unbalanced data, Using Weight in Multinomial Regression Model

GiaLee · Posted 02-19-2024 01:20 AM

Hi,

I am conducting a multinomial regression analysis with the outcome variable being membership class. The membership classes include group 1 (high trajectory over time, n=170), group 2 (increasing, n=80), and group 3 (low, n=20).

My aim is to identify the risk factors associated with being in the low trajectory group. I have 10 variables to test, and due to the limited sample size in group 3, I have chosen to initially conduct univariate regression analyses for each risk factor and select those with a P-value <0.05 for inclusion in the multivariate regression model. However, one variable in group3 has a count of 0 in race categories, leading to extremely high OR, >999.999. I would like to seek advice on the following:

1. Should I directly exclude race variable from the model? Could I justify excluding it by stating that some race categories have zero counts, making them unusable in the analysis?

2. Given the imbalance in my data (170:80:20), could I weight group numbers in the model? The results look much better after weighting group. Here is my code:

proc logistic data=a;
class group (ref = "3") race (ref="White") inccat(ref=">$150,000")
hypertension_Disease (ref="0") diabetes_Disease(ref="0");
model group = race inccat hypertension_Disease diabetes_Disease/link = glogit
aggregate scale=none;

weight group;
run;

3.I also tried oversampling, but I found that it might yield inaccurate results, so I'm hesitant to use this method.

Here is my code:

data have;
set a;
run;

proc surveyselect data=have out=want sampsize=(170 80 80) method=urs outhits;
strata group;
run;

Any comments are appreciated, thanks!

Mike_N · Posted 02-19-2024 10:20 AM

As a general comment, a number of SAS procedures have built-in model selection capabilities. In your case, take a look at PROC HPGENSELECT. It's possible that you will find a more optimal model-selection method there than filtering based on p-values.

Regardless, for your first question, if you have reason to believe that race has a strong association with your outcome, I would try to keep it in the model. You could consider collapsing some of the categories (e.g., white vs. not). Alternatively, you could consider collapsing the three membership classes into 2, for instance, high trajectory vs. not.

For your second question, don't weight the analysis unless you have a specific reason to do so. For example, if your data are from a survey and there are survey weights based on the sampling approach. Likewise, I would advise against using the 'oversampling' procedure that you are showing; duplicating observations from the 'low' category so that you artificially have a larger sample size will not generalize well.

View solution in original post

Mike_N · Posted 02-19-2024 10:20 AM

As a general comment, a number of SAS procedures have built-in model selection capabilities. In your case, take a look at PROC HPGENSELECT. It's possible that you will find a more optimal model-selection method there than filtering based on p-values.

Regardless, for your first question, if you have reason to believe that race has a strong association with your outcome, I would try to keep it in the model. You could consider collapsing some of the categories (e.g., white vs. not). Alternatively, you could consider collapsing the three membership classes into 2, for instance, high trajectory vs. not.

For your second question, don't weight the analysis unless you have a specific reason to do so. For example, if your data are from a survey and there are survey weights based on the sampling approach. Likewise, I would advise against using the 'oversampling' procedure that you are showing; duplicating observations from the 'low' category so that you artificially have a larger sample size will not generalize well.

sbxkoenk · Posted 02-19-2024 10:52 AM

Concurring with @Mike_N here.

Two more remarks :

The smallest group size sets the limit for the number of covariates that you can include in model without over-fitting. So, for a group size of 20, you might be able to have 1 or 2 covariates in the model (3 seems already exaggerated to me!).
The univariate "screening" of covariates before entering them into the multivariable logistic model (whether binary, ordinal, or multinomial) is to be avoided. More optimal model selection methods exist in PROC (HP)GENSELECT.

Be aware that there are multiple analysis methods available for modelling a multinomial target besides a logistic regression with glogit link. For example and among others :

Multiple-group discriminant function analysis: A multivariate method for multinomial outcome variables

Koen

GiaLee · Posted 02-19-2024 04:46 PM

Thank you very much!
I haven't heard Multiple-group discriminant function analysis before. I'll take a look. Thanks!

GiaLee · Posted 02-19-2024 04:43 PM

Thank you!
Is PROC HPGENSELECT only accessible in SAS 9.4 version? I tried with it in SAS Studio and got the error message "The data set test must utilize a CAS engine libref."

Mike_N · Posted 02-20-2024 09:44 AM

I am able to run PROC HPGENSELECT on either SAS 9.4 or SAS Studio. For some example code that I can run successfully, see here: https://go.documentation.sas.com/doc/en/statcdc/14.3/stathpug/stathpug_code_hpgenex1.htm .

GiaLee · Posted 02-20-2024 10:39 PM

Thank you very much! It works!

GiaLee · Posted 02-20-2024 11:38 PM

I have a few additional questions regarding this model:
1. Odds ratio calculation: I ran a multinomial model, I used PROC HPGENSELECT and PROC HPLOGISTIC, but neither provided the odds ratio. Both procedures generated same parameter estimates. Could I calculate the odds ratios and CI by exp the parameter estimates?

2. Missing data: Is it possible to use multiple imputation with PROC HPGENSELECT? I attempted to summarize the results using PROC MINIMIZE but failed.

3. Model selection: After using PROC HPGENSELECT to select two variables, could I choose these variables and conduct a standard logistic regression analysis? This would allow me to easily obtain odds ratios and impute missing data.

Best,
Gia

SAS Innovate 2025: Call for Content