BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
David_M
Calcite | Level 5

I'm new to SAS (v9.4) and statistics in general.

 

I want to do an ordinal logistic regression (N = 430 employees) . My dependent and independent variables are ordinals. Job Satisfaction (scaled from 1 -5) and Work Respect (scaled from 1 - 4).  I have 35 potential covariates (confounders) and I want to come up with a reduced number of variables that are greatly de-correlated from each other before adding them to my model. These covariates are binary, ordinal, continuous, and nominal variables.

 

I'm at a loss/unsure which SAS function(s) to use to remove / drastically reduce any co-dependencies within these variables. Will de-correlation via GVIF work for all or will CATPCA analysis be enough? The SAS functions I've looked at either work for categorical variables only (ordinals and nominals) or some combination of three out of four variable types.

 

What do you recommend I do to eliminate collinearity amongst mixed variables?

 

PS

Grok 3 says I need to do individual de-correlation procedures suitable for each variable type. I'm hesitant to believe it for now. I am concerned that, for example, a reduced continuous variable set might correlate with reduced ordinal set, if I perform separate analyses.

 

EDIT ... made a slight change for better clarity.

1 ACCEPTED SOLUTION

Accepted Solutions
PaigeMiller
Diamond | Level 26

PROC PLS can handle categorical predictors. It is a built in feature, using the CLASS statement. So predictors do not need to be continuous. They can be categorical.

 

If your response is binary or multinomial, there are versions of PLS that work in that case (which is the PLS equivalent of logistic regression). See https://cedric.cnam.fr/fichiers/RC1060.pdf. There is an R-package which will perform the calculations, and I think there is a Python package as well. There is a SAS macro that I wrote to perform Logistic PLS but I am not a liberty to share it, as it is proprietary and belongs to my employer. But I have used this logistic PLS many times and I find it works well and performs exactly as I would expect it would perform.

 

Regarding doing some type of Principal Components on your data, this is an inferior approach, because PCA finds dimensions of your x-variables that do not have to be predictive of your response, because PCA does not use the y-variable at all. PLS finds dimensions of your x-variables that are predictive of your response (if the data contains predictive ability), because PLS specifically uses the y-variable in the algorithm to find predictive dimensions.

--
Paige Miller

View solution in original post

20 REPLIES 20
PaigeMiller
Diamond | Level 26

You can't "eliminate or reduce" collinearity. It is what it is.

 

You can however, choose estimation methods that are robust to collinearity. The primary method I recommend is Partial Least Squares (PROC PLS). In this paper, Tobias fits a PLS model to data with 1000 highly correlated x-variables, and comes up with a useful model. Or you can use other methods and spend a huge amount of time (which Tobias didn't need to do) selecting a subset of your variables. To me the choice is simple.

--
Paige Miller
David_M
Calcite | Level 5

Thank you for the fast response. Based on the article you posted and some other reading, yes, PLS is great de-correlator when used with highly collinear variables. But what I've read so far, all these variables are continuous only. Am I correct in this?

 

Corrected my original question to say I want a reduced set or number of mixed type variables that are de-correlated from each other.

Season
Barite | Level 11

@David_M wrote:

Thank you for the fast response. Based on the article you posted and some other reading, yes, PLS is great de-correlator when used with highly collinear variables. But what I've read so far, all these variables are continuous only. Am I correct in this?

 


The independent variables of PLS are chiefly continuous ones. To me, that is a major limitation of PLS. To the best of my knowledge, there have been variants of PLS that can accomodate categorical independent variables as well (e.g., Treatment of Categorical Variables with Missing Values Using PLS Regression | SpringerLink).

Another way to deal with this is to use penalized regression like ridge regression and LASSO.

PaigeMiller
Diamond | Level 26

PROC PLS can handle categorical predictors. It is a built in feature, using the CLASS statement. So predictors do not need to be continuous. They can be categorical.

 

If your response is binary or multinomial, there are versions of PLS that work in that case (which is the PLS equivalent of logistic regression). See https://cedric.cnam.fr/fichiers/RC1060.pdf. There is an R-package which will perform the calculations, and I think there is a Python package as well. There is a SAS macro that I wrote to perform Logistic PLS but I am not a liberty to share it, as it is proprietary and belongs to my employer. But I have used this logistic PLS many times and I find it works well and performs exactly as I would expect it would perform.

 

Regarding doing some type of Principal Components on your data, this is an inferior approach, because PCA finds dimensions of your x-variables that do not have to be predictive of your response, because PCA does not use the y-variable at all. PLS finds dimensions of your x-variables that are predictive of your response (if the data contains predictive ability), because PLS specifically uses the y-variable in the algorithm to find predictive dimensions.

--
Paige Miller
Season
Barite | Level 11

Thanks for the feedback. But to the best of my knowledge, the "generic" PLS algorithm treats categorical independent variables the same way as it treats continuous ones. In other words, the fact that categorical variables only take finite values is not taken into consideration in the "generic" PLS algorithm. I am not sure the extent to which adapted PLS algorithms like the one I cited in my previous reply to @David_M take this into account as I have not stuided on PLS in the recent years.

PaigeMiller
Diamond | Level 26

I don't know what handling categorical values with missing values has to do with this thread.

 

Clearly, PLS has been modified to include categorical predictors, as SAS has been able to do it for at least 20 years now. I don't know if SAS gets the credit for this invention, or if it had been published previously elsewhere.

--
Paige Miller
David_M
Calcite | Level 5

Thanks for the excellent referrals...I read the SAS help you alluded to regarding the CLASS statement that " Typical classification variables are Treatment, Sex, Race, Group, and Replication. If you use the CLASS statement, it must appear before the MODEL statement statement. Classification variables can be either character or numeric. By default, class levels are determined from the entire set of formatted values of the CLASS variables. "

 

Are they saying (fingers crossed) that I can define all variable types (binary, categorical and continuous) in this statement and all is golden? I see "Sex" as binary, "Race" and possibly "Treatment" as nominal. Not sure what type of variables Group and Replication might be.

 

Thanks for the PLS referral btw ... that PCA is not responsive to Y variables, just X independent ones only.  I also appreciate your patience with my newbie-ness.

PaigeMiller
Diamond | Level 26

@David_M wrote:

 

Are they saying (fingers crossed) that I can define all variable types (binary, categorical and continuous) in this statement and all is golden? I see "Sex" as binary, "Race" and possibly "Treatment" as nominal. Not sure what type of variables Group and Replication might be.


SAS has only two types of variables, character and numeric. SAS modeling PROCs have either categorical or continuous variables.

 

Are they saying (fingers crossed) that I can define all variable types (binary, categorical and continuous) in this statement and all is golden?

 

You can interpret the variables as binary, categorical or continuous as you wish. At a certain point, PLS is no longer optimal (if it ever was optimal) and you may want to consider other modelling methods, as all is not golden in some situations. Which is true of any modeling approach.

--
Paige Miller
Ksharp
Super User

For logistic model , you could try CORRB option to check the correlation between estimated coefficients.

 

proc logistic data=sashelp.heart;
class sex;
model status=sex ageatstart weight height Diastolic Systolic/corrb;
run;

Ksharp_0-1746846053096.png

 

David_M
Calcite | Level 5

Interesting and thank you ... this fits with the statement  @PaigeMiller said that you can define categorical variables in the CLASS statement of the PROC PLS procedure. So will the PROC logistic example you've shown accomplish the same goal of eventually producing a set of de-correlated variables for my model?

PaigeMiller
Diamond | Level 26

So will the PROC logistic example you've shown accomplish the same goal of eventually producing a set of de-correlated variables for my model?

 

PROC LOGISTIC does not produce de-correlated variables. The variables remain correlated. (You can choose to remove certain variables from the model with the goal of having the multicollinearity have less effect on the model fit, but the ones that remain will still be correlated)

 

That's really to goal of this thread, as I have understood it. We want to find estimation methods that are less affected by collinearity, not the removal of collinearity which isn't possible.

--
Paige Miller
David_M
Calcite | Level 5

My original goal (which I hope I clearly stated and apologies if I didn't) was to reduce the number of highly correlated 35 mixed type variables to a lesser number that are not or much less correlated with each other. Won't the PROC CORR function with the /corrb option show correlation coefficients that could be used to target high correlation variables for elimination? Won't the PLS procedure you suggested do something similar?

PaigeMiller
Diamond | Level 26

@David_M wrote:

My original goal (which I hope I clearly stated and apologies if I didn't) was to reduce the number of highly correlated 35 mixed type variables to a lesser number that are not or much less correlated with each other. Won't the PROC CORR function with the /corrb option show correlation coefficients that could be used to target high correlation variables for elimination?


Maybe, maybe not. You could, for example, remove one of the two variables which have coefficients highly correlated with each other. You may remove the best predictor of the two, which would not be good. But multicollinearity takes other forms, such as a linear combination of three or more variables produces an almost constant result. In that case, maybe the correlation of the coefficients won't show that because it is looking at pairwise correlations, but this multicollinearity cannot be seen pairwise, it can be seen if you look at the three or more variables. This type of multicollinearity will impact the quality of the model fit, but may not be seen from pairwise correlations.

 

Won't the PLS procedure you suggested do something similar?

 

No, that's not what PLS does. PLS leaves all the variables in the model, and fits the model in such a way that it is robust to the effects of multicollinearity. In the paper by Tobias, he has 1,000 predictor variables, and the PLS model uses all 1,000 variables, and Tobias decides that the model is useful.

 

 

--
Paige Miller
David_M
Calcite | Level 5

Ok, thanks for the clarification. So will PLS also generate regression coefficients that will map my 35 variables to the Y response in its model?

sas-innovate-white.png

Our biggest data and AI event of the year.

Don’t miss the livestream kicking off May 7. It’s free. It’s easy. And it’s the best seat in the house.

Join us virtually with our complimentary SAS Innovate Digital Pass. Watch live or on-demand in multiple languages, with translations available to help you get the most out of every session.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 20 replies
  • 647 views
  • 11 likes
  • 4 in conversation