I am trying to run a model with logistic regression containing about 20 independent variables, both categorical and continuous.
However, I am finding that the significance varies depending on which variables I include and exclude, and I believe that there is association and collinearity among the variables.
As I am a new SAS user, is there any simple way to check for association among the variables in logistic regression?
Thank You
Not my area of expertise, but the following might help: http://support.sas.com/kb/32/471.html
Art, CEO, AnalystFinder.com
without knowing much about it, eg how many obs you have, 20 variables sounds like a lot and could be affecting things. There are some rules-of-thumb out there eg in survival analysis i think they call it 'failures per variable' (FPV) and 10 is considered sufficient. There would be something analogous for logistic regression i guess. Regarding associations among the variables, normally this would be based on an understanding of the data, ie it would be anticipated and a priori rather than data-dependent. But if you want to examine correlations among the variables then that could be done, even if the variables are of different types eg proc corr will give the the biserial correlation i think, or there's a macro for it: http://support.sas.com/kb/24/991.html
proc logistic is modeling by MLE , unlike proc reg by OLS.
Usually sas would do it for you automatically. Check PROC HPGENSELECT ,there are many selection method about variables,like CV , LASSO ....
I have a large number of observations, 200,000 weighted, so there should be no issue with the 20 variables from that stand point.
I am also just trying to find associations between the independent variables and the dependent variable, and am not interested in building a powerful model. However, when I add or remove some of the variables, it causes a few of the other variables to change significance drastically, sometimes becoming significant only after adding another variable to the model. I don't want to come up with an association that may differ from what someone else may find if they look for they same associations (for example, if they have a slightly different selection of variables and show difference in significance from what I have shown, that would make my study seem inaccurate).
Thank you
in that case, the first thing i'd do (maybe you have already) is write a macro that fits the model for a single independent variable, and then run this macro for each of the 20 variables (some call these 'univariate models'), just to get a sense of things and to see which are the strongest predictors on their own. You could stop here because you are "not interested in building a powerful model". But if you want to see if any variables are superfluous you could then attempt a 'multivariate model' (a misnomer but this is how some people describe it) using only those variables that looked good in the univariate models. Although with 200,000 obs maybe every variable shows a small p-value, ie this approach is common in medical research but it really depends on what you're doing. Eg, in the methods section in this article, see the 6 steps they describe: https://www.nature.com/articles/7211492
Edit: regarding whether others can reproduce your results, as long as you layout your approach as they do in that article, then i'd say it's fine
" fits the model for a single independent variable, "
That is called perfect model. That is not right according to statistical theory.
I suggest to use PROC HPGENSELECT to let sas select variables for you .
Don't use selection=stepwise/forward/backward, try CV/LASSO/LASTIC ....,more info check doc of PROC HPGENSELECT .
A followup question, say that an independant variables has significant association on the "univariate" analysis, and non-significant on "multivariate" analysis, will I be able to make any use of the adjusted odds-ratio for that variable, if the p-value is non-significant ?
I have seen studies where they list the adjusted odd-ratio without a p-value, so I am wondering if it holds any importance when it is non-significant?
Thank You
Another question, if I find a categorical variable has non-significant association on multivariate analysis under "analysis of likelihood estimates", but the "Type 3 analysis of effects" shows that it is significant, what does that mean and how can it be interpreted?
@sasnewbie12 wrote:
I am trying to run a model with logistic regression containing about 20 independent variables, both categorical and continuous.
However, I am finding that the significance varies depending on which variables I include and exclude, and I believe that there is association and collinearity among the variables.
As I am a new SAS user, is there any simple way to check for association among the variables in logistic regression?
Thank You
You keep asking the same questions over and over, and my answers don't change, just because you ask again three weeks later. I repeat my answer given here: https://communities.sas.com/t5/SAS-Statistical-Procedures/multivariate-logistic-regression-variable-...
Variable selection is fundamentally a poor approach when you have many correlated variables. It doesn't matter if you are new to SAS or experienced in SAS or using R or Python or Minitab. It is not the software that makes it a poor approach.
At that link, I reference a method of performing Logistic Partial Least Squares regression, fundamentally a superior approach. There is R code to do this, but I am not aware of SAS code to do this. However, since you can run R code through SAS PROC IML, that seems to be the approach I would take.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.