BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
sasnewbie12
Obsidian | Level 7

I am assessing for outcome "eventX" with survey data.

 

One variables, "diseaseX" has an association of p=0.023 on univariate chi square. 

 

When placed in the multivariate regression model with multiple other variables, it has a lower p-value of 0.0003. SAS does not give any messages about correlation and the model has convergence. 

 

If this is due to some kind association where one variable reenforces another (forgot what thats called), then how can I find which variable it is? Otherwise, how can I deal with this, is it ok to leave the variable in the model, if there is an association?

 

Please explain. Thanks 

1 ACCEPTED SOLUTION

Accepted Solutions
PaigeMiller
Diamond | Level 26

This is a major drawback to having multiple independent variables which are correlated with one another. If you add (or subtract) a variable from the model, the estimated regression coefficient can change (sometimes dramatically) and the p-value can change (sometimes dramatically). 

 

How can you deal with this? Well, fundamentally, I think that variable selection strategies are flawed in the case where the independent variables are highly correlated, and further, there is no logical way to determine the effect  of variable x1 independently of the other variables.

 

So that leads me to a method that is different conceptually. It includes all variables, so there is no issue of variable selection. It does not try to determine the unique independent effect of each variable; it tries to determine a good predictive model. That method is Partial Least Squares Regression (PROC PLS in SAS). For the logistic case, you could either model 0/1 responses, or you could use this method: https://cedric.cnam.fr/fichiers/RC906.pdf

--
Paige Miller

View solution in original post

7 REPLIES 7
PaigeMiller
Diamond | Level 26

This is a major drawback to having multiple independent variables which are correlated with one another. If you add (or subtract) a variable from the model, the estimated regression coefficient can change (sometimes dramatically) and the p-value can change (sometimes dramatically). 

 

How can you deal with this? Well, fundamentally, I think that variable selection strategies are flawed in the case where the independent variables are highly correlated, and further, there is no logical way to determine the effect  of variable x1 independently of the other variables.

 

So that leads me to a method that is different conceptually. It includes all variables, so there is no issue of variable selection. It does not try to determine the unique independent effect of each variable; it tries to determine a good predictive model. That method is Partial Least Squares Regression (PROC PLS in SAS). For the logistic case, you could either model 0/1 responses, or you could use this method: https://cedric.cnam.fr/fichiers/RC906.pdf

--
Paige Miller
sasnewbie12
Obsidian | Level 7

How do I use this for survey data?

I need to account for stratums, clusters, and weights.

I am currently using Proc Surveylogistic.

PaigeMiller
Diamond | Level 26

As there is no WEIGHT statement in PROC PLS, it's not going to fit your problem, but SURVEYLOGISTIC doesn't really do a good job either in the case of many correlated x-variables.

--
Paige Miller
sasnewbie12
Obsidian | Level 7

I haven't seen the warning "WARNING: The information matrix is singular and thus the convergence is questionable"  and I am not getting any errors in the log statement. However, there is some other possible association between variables. 

 

I wonder if there is any way I can see whether some variables have whatever assocation there may be because I can find the problem variables and then remove them manually. 

 

 

 

Reeza
Super User

PROC CORR does support the use of weights, but not stratums. I think it may give you a starting point. 

 

 

PaigeMiller
Diamond | Level 26

@sasnewbie12 wrote:

I haven't seen the warning "WARNING: The information matrix is singular and thus the convergence is questionable"  and I am not getting any errors in the log statement. However, there is some other possible association between variables. 

 


If the correlation between variables is not 1 or –1, then you will not get such a warning. If the correlation between two independent variables is (for example) 0.99, you will not get the warning, but you will get the problem you mentioned above that the model coefficients and the significance of the coefficients can change drastically when you add or remove variables from the model.

 

I wonder if there is any way I can see whether some variables have whatever assocation there may be because I can find the problem variables and then remove them manually. 

 

Maybe the idea from @Reeza can be modified to allow PROC CORR to show you the correlations, but then the problem remains, which of the correlated variables do you remove? How would you choose? What if you get better predictions with leaving the correlated variables in the model? This is why I like the concept of Partial Least Squares, it has none of these difficulties, and it handles correlated predictor variables better than most other methods. It does not handle the stratums, clusters and weights, but doing some Googling (is that a real word?) I find this article: Optimized sample-weighted partial least squares, which apparently is a version of Partial Least Squares that would work with survey data (if I am understanding the abstract properly). Of course, there is no SAS code for this, a major drawback.

--
Paige Miller
PaigeMiller
Diamond | Level 26

Reply above was updated apparently after @sasnewbie12 read it and clicked on Like. I added the results of my Google search for a PLS method that could be used with survey data.

--
Paige Miller

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 7 replies
  • 2503 views
  • 4 likes
  • 3 in conversation