06-16-2017 07:01 AM
I have performed a logistic regression with dichotomous dependent variable, 2 continuous independent variable, 14 dichotomous variables and 1 multi-level variable.
All independent variables were significant in univariate logistic regression, except one. I have a problem with a specific variable (in theunivariate and so in the multivariate logistic regression), I obtain this result ODDS: Estimate >999.99 and the 95% Confidence Limits >999.999 - > 999.999.
however, If I calculate OR by using proc freq using the dependent variable * independent variable, I have the following result estimate point: 1212.8991 and 95% Confidence Limits 1031.0208 1426.8618.
It’s a huge problem because it is an important variable for excluding it, and I know that the problem is the unbalanced data (in one cell I have 147 observations out of 164.000). What can I do?
I have tried also with Firth penalization without success, and to use exact analysis in proc logistic without success. I was considering also the proc glimmix but maybe I have not found the correct options to include…
What can I do for using that variable in my model? Which approach can I try??
Please help me, thank you
06-16-2017 08:04 AM
For what its worth, when you have many input variables, and they are correlated with one another, it is my opinion (and also the opinion of many others) that you cannot really determine which variables are important, and you cannot determine the exact amount of their importance independent of other variables -- which seems to be what you are trying to do.
The best you can do in this situation is to find a model that fits the data well and gives you good predictions. This is possible in this situation, and maybe the model you have fits well enough.
Also, the idea of comparing a univariate regression or PROC FREQ analysis to the results of your multiple input variable regression seems a bit strained, as they don't have to match.
06-16-2017 08:37 AM
06-16-2017 04:21 PM
I have tested a set of variables with univariate logistic regression for selecting the variables to test in the complete model.
I'm not sure that's a valid approach.