Solved: Re: Way of handling categorical independent variables in partial least...

Season · Posted 10-18-2023 11:00 AM

Hello, I am building a partial least squares (PLS) model with categorical independent variables. I understand that the CLASS statement is useful in telling SAS which of the variables are categorical. But as the theory and SAS Help tells us, all of the variables are centered. Does that rule apply to categorical variables as well? That is, are categorical variables "centered" (via the formula x-xbar/std(x)) in the very same way as continuous variables do without considering their categorical nature?

Thank you!

Rick_SAS · Posted 10-18-2023 11:28 AM

Yes, that is correct. All regression problems create a design matrix and then operate on that matrix. Classification variables are converted to dummy variables by using the GLM parameterization. The columns of the design matrix are then (by default) centered and scaled without consideration of how the columns were created.

You can see this by modifying an example from the doc to include a CLASS variable. Run PROC PLM to get the parameter estimates. Then use PROC GLMMOD or some other procedure to generate the design matrix and use PROC STDIZE to center and scale the columns. If you then use the columns of the design matrix in PROC PLM, you will get the same parameter estimates:

/* example from PROC PLM doc, but add new CLASS variable, C */
data pentaTrain;
   input obsnam $ S1 L1 P1 S2 L2 P2
                  S3 L3 P3 S4 L4 P4
                  S5 L5 P5  log_RAI @@;
   n = _n_;
   call streaminit(123);
   C = rand("Table", 0.4, 0.3, 0.3);
   if C=2 then log_RAI = log( exp(log_RAI)+2 );
   else if C=3 then log_RAI = log( exp(log_RAI)+3 );
   datalines;
VESSK    -2.6931 -2.5271 -1.2871  3.0777  0.3891 -0.0701
          1.9607 -1.6324  0.5746  1.9607 -1.6324  0.5746
          2.8369  1.4092 -3.1398                    0.00
VESAK    -2.6931 -2.5271 -1.2871  3.0777  0.3891 -0.0701
          1.9607 -1.6324  0.5746  0.0744 -1.7333  0.0902
          2.8369  1.4092 -3.1398                    0.28
VEASK    -2.6931 -2.5271 -1.2871  3.0777  0.3891 -0.0701
          0.0744 -1.7333  0.0902  1.9607 -1.6324  0.5746
          2.8369  1.4092 -3.1398                    0.20
VEAAK    -2.6931 -2.5271 -1.2871  3.0777  0.3891 -0.0701
          0.0744 -1.7333  0.0902  0.0744 -1.7333  0.0902
          2.8369  1.4092 -3.1398                    0.51
VKAAK    -2.6931 -2.5271 -1.2871  2.8369  1.4092 -3.1398
          0.0744 -1.7333  0.0902  0.0744 -1.7333  0.0902
          2.8369  1.4092 -3.1398                    0.11
VEWAK    -2.6931 -2.5271 -1.2871  3.0777  0.3891 -0.0701
         -4.7548  3.6521  0.8524  0.0744 -1.7333  0.0902
          2.8369  1.4092 -3.1398                    2.73
VEAAP    -2.6931 -2.5271 -1.2871  3.0777  0.3891 -0.0701
          0.0744 -1.7333  0.0902  0.0744 -1.7333  0.0902
         -1.2201  0.8829  2.2253                    0.18
VEHAK    -2.6931 -2.5271 -1.2871  3.0777  0.3891 -0.0701
          2.4064  1.7438  1.1057  0.0744 -1.7333  0.0902
          2.8369  1.4092 -3.1398                    1.53
VAAAK    -2.6931 -2.5271 -1.2871  0.0744 -1.7333  0.0902
          0.0744 -1.7333  0.0902  0.0744 -1.7333  0.0902
          2.8369  1.4092 -3.1398                   -0.10
GEAAK     2.2261 -5.3648  0.3049  3.0777  0.3891 -0.0701
          0.0744 -1.7333  0.0902  0.0744 -1.7333  0.0902
          2.8369  1.4092 -3.1398                   -0.52
LEAAK    -4.1921 -1.0285 -0.9801  3.0777  0.3891 -0.0701
          0.0744 -1.7333  0.0902  0.0744 -1.7333  0.0902
          2.8369  1.4092 -3.1398                    0.40
FEAAK    -4.9217  1.2977  0.4473  3.0777  0.3891 -0.0701
          0.0744 -1.7333  0.0902  0.0744 -1.7333  0.0902
          2.8369  1.4092 -3.1398                    0.30
VEGGK    -2.6931 -2.5271 -1.2871  3.0777  0.3891 -0.0701
          2.2261 -5.3648  0.3049  2.2261 -5.3648  0.3049
          2.8369  1.4092 -3.1398                   -1.00
VEFAK    -2.6931 -2.5271 -1.2871  3.0777  0.3891 -0.0701
         -4.9217  1.2977  0.4473  0.0744 -1.7333  0.0902
          2.8369  1.4092 -3.1398                    1.57
VELAK    -2.6931 -2.5271 -1.2871  3.0777  0.3891 -0.0701
         -4.1921 -1.0285 -0.9801  0.0744 -1.7333  0.0902
          2.8369  1.4092 -3.1398                    0.59
;

/* run PROC PLM by using the CLASS stmt */
proc pls data=pentaTrain;
   class C;
   model log_RAI = C S1-S5 L1-L5 P1-P5 / solution;
   ods select CenScaleParms ParameterEstimates;
run;

/* generate the design matrix */
proc glmmod data=pentaTrain outdesign=Design;
   class C;
   model log_RAI = C S1-S5 L1-L5 P1-P5 / noint;
run;

/* center and scale the design matrix */
proc stdize data=Design out=StdDesign;
run;

/* rerun PROC PLM on the columns of the design matrix */
proc pls data=StdDesign;
   model log_RAI = Col: / solution;
   ods select CenScaleParms ParameterEstimates;
run;

View solution in original post

PaigeMiller · Posted 10-18-2023 11:27 AM

@Season wrote:

Hello, I am building a partial least squares (PLS) model with categorical independent variables. I understand that the CLASS statement is useful in telling SAS which of the variables are categorical. But as the theory and SAS Help tells us, all of the variables are centered. Does that rule apply to categorical variables as well? That is, are categorical variables "centered" (via the formula x-xbar/std(x)) in the very same way as continuous variables do without considering their categorical nature?

I don't think there is any such thing as centering or standardizing the categorical variables. Behind the scenes PROC PLS performs a similar operation on dummy variables, but you shouldn't have to worry about this for categorical variables; all conversion to dummy variables and handling of dummy variables is done for you behind the scenes so you don't have to do it yourself. This is one of the benefits of the way the CLASS statement operates in many SAS procedures. Any results (predicted values, loadings, regression coefficients, etc.) from PROC PLS would be using the actual categorical values.

--
Paige Miller

Season · Posted 10-18-2023 11:37 AM

Thank you, Paige, for your prompt answer! I have learnt from SAS Help about the automatic coding of dummy variables by CLASS statement in various PLS procedures. As I have stated, the question I concern is the centering procedure of the variables. It is strange to deem categorical variables as continuous and calculate their means and standard deviations in the very same way. But that seems to be the case in PLS.

Of course, the centering procedure is not at all forbidden from a computational perspective, as we can deem categorical predictors as "continuous predictors taking a finite set of values" in this computation process. I just thought it a bit strange, so I came here to see if my understanding about PLS were correct.

Anyway, thank you for your reply!

PaigeMiller · Posted 10-18-2023 11:50 AM

@Season wrote:

It is strange to deem categorical variables as continuous and calculate their means and standard deviations in the very same way. But that seems to be the case in PLS.

But this isn't correct. PROC PLS (and all SAS PROCs that I know of that model categorical variables) are not finding means and standard deviations of categorical variables. Means and standard deviations of categorical variables do not exist. SAS converts the categorical variables to 0/1 dummy variables, which are numeric, and those do have means and standard deviations and can be centered and scaled.

And it's not just PLS where this is done. Any SAS modeling procedure with a CLASS statement will do this.

--
Paige Miller

Season · Posted 10-18-2023 11:56 AM

OK, I see. In fact, the "mean" of a categorical variable should be termed "mathematical expectation", so it does exist, but not in the manner as continuous ones do. I agree with you on your reply that the standard deviation of a categorical variable does not exist.

I now make adjustment to my previous reply. That is, I think it strange to center dummy variables, as they are also discrete.

Season · Posted 10-18-2023 11:58 AM

I would like to further ask about the residual of linear PLS modeling. I noticed that SAS could compute the residual plots upon request. Since PLS does not alter the linear nature of the model, does the requirement of residual of the model of each of the dependent variables following the Gauss-Markov assumption still applies?

PaigeMiller · Posted 10-18-2023 12:06 PM

As far as I know, the Gauss-Markov assumption never applied to PLS.

--
Paige Miller

Season · Posted 10-18-2023 12:10 PM

OK, thank you very much for your reply! So here is a rather open question: about what should we care about the residual of linear PLS modeling? Were Gauss-Markov not required in PLS, I cannot see any reason to take a look at the residual plots?

Or I may ask in a rather closed way: are there any requirements in the residuals of PLS? As far as I know, there seems to be none.

PaigeMiller · Posted 10-18-2023 12:19 PM

Residual plots in linear regression and in PLS (and really all modeling that I am aware of) indicate problems in the data, such as curvature (or other improperly specified model), clustering, outliers. They are diagnostic. In some cases (e.g. linear regression) they also diagnose deviation from assumptions of iid normal distribution of the residuals. Residuals are also a method of estimating how good or bad the fit to the model is, larger residuals en masse indicating a less good fit than smaller residuals en masse.

There are no requirements or assumptions on the residuals for PLS, but because they are diagnostic, obvi if you find curvature or clustering or outliers, you may want to change the model and/or remove the outliers and re-fit (and possibly other solutions as well).

--
Paige Miller

Season · Posted 10-18-2023 12:27 PM

Thank you so much, Paige, for your prompt and patient replies! Your expertise has helped me save much time that may will otherwise been spent on reading literatures.

It's a pity that SAS Community only supports accepting one single reply as the solution.

Thank you again!😀

PaigeMiller · Posted 10-18-2023 02:02 PM

I feel like I should add that PLS provides residuals in the X-direction, and residuals in the Y-direction. If there are multiple Y variables, you can test to see (visually or otherwise) if the observation is a multivariate outlier. It may be that there are some observations where the Y variable(s) are univariate outliers but not multivariate outliers; and other observations which are not univariate outliers in the Y variables, but which are multivariate outliers in the Y variables. Similarly, when there are multiple X variables, the analagous situations hold regarding multivariate and univariate outliers. The residual plots will detect these (for X, the multivariate outliers from PROC PLS are called STDXSSE; for Y the multivariate outliers from PROC PLS are called STDYSSE).

I realize that for some people, this is un-intuitive and hard to understand, but this is another of PLS's great strengths.

--
Paige Miller

Season · Posted 10-21-2023 01:01 AM

@PaigeMiller wrote:

I feel like I should add that PLS provides residuals in the X-direction, and residuals in the Y-direction. If there are multiple Y variables, you can test to see (visually or otherwise) if the observation is a multivariate outlier. It may be that there are some observations where the Y variable(s) are univariate outliers but not multivariate outliers; and other observations which are not univariate outliers in the Y variables, but which are multivariate outliers in the Y variables. Similarly, when there are multiple X variables, the analagous situations hold regarding multivariate and univariate outliers. The residual plots will detect these (for X, the multivariate outliers from PROC PLS are called STDXSSE; for Y the multivariate outliers from PROC PLS are called STDYSSE).

I realize that for some people, this is un-intuitive and hard to understand, but this is another of PLS's great strengths.

I generally agree with your opinion. But I think that the concept of residuals and outliers are all independent on pairs of Xs and Ys. That is, I don't think there is such a thing as univariate or multivariate Y outlier unconditional on X. So, when we say an observation is a multivariate X outlier, we have to point out that it is on which dimension of Y that it is an outlier.

I would like to further consult on the distribution of the regression coefficient estimates in linear PLS model. Do they follow a normal distribution?

I ask this question because I am building a PLS model with missing data. I need to pool the regression coefficients after modeling in each imputed sample. If the regression coefficients did not follow a normal distribution, then they could not be pooled directly.

Thank you!

PaigeMiller · Posted 10-21-2023 05:53 AM

@Season wrote:

@PaigeMiller wrote:

I feel like I should add that PLS provides residuals in the X-direction, and residuals in the Y-direction. If there are multiple Y variables, you can test to see (visually or otherwise) if the observation is a multivariate outlier. It may be that there are some observations where the Y variable(s) are univariate outliers but not multivariate outliers; and other observations which are not univariate outliers in the Y variables, but which are multivariate outliers in the Y variables. Similarly, when there are multiple X variables, the analagous situations hold regarding multivariate and univariate outliers. The residual plots will detect these (for X, the multivariate outliers from PROC PLS are called STDXSSE; for Y the multivariate outliers from PROC PLS are called STDYSSE).

I realize that for some people, this is un-intuitive and hard to understand, but this is another of PLS's great strengths.

I generally agree with your opinion. But I think that the concept of residuals and outliers are all independent on pairs of Xs and Ys. That is, I don't think there is such a thing as univariate or multivariate Y outlier unconditional on X. So, when we say an observation is a multivariate X outlier, we have to point out that it is on which dimension of Y that it is an outlier.

I disagree, a data point can be an outlier in the X direction regardless of which Y (if any) it is predictive of. Similarly, a data point can be an outlier in the Y direction regardless of which X might predict it (if any).

I would like to further consult on the distribution of the regression coefficient estimates in linear PLS model. Do they follow a normal distribution?

There are no distributional assumptions involved in Partial Least Squares regression. Any test of significance is done via bootstrapping or similar method.

I ask this question because I am building a PLS model with missing data.

PLS with missing data can be handled in PROC PLS using the EM algorithm (option MISSING=EM) or by replacing a missing with the average of the non-missings (option MISSING=AVG).

--
Paige Miller

Season · Posted 10-21-2023 06:26 AM

@PaigeMiller wrote:

@Season wrote:

@PaigeMiller wrote:

I feel like I should add that PLS provides residuals in the X-direction, and residuals in the Y-direction. If there are multiple Y variables, you can test to see (visually or otherwise) if the observation is a multivariate outlier. It may be that there are some observations where the Y variable(s) are univariate outliers but not multivariate outliers; and other observations which are not univariate outliers in the Y variables, but which are multivariate outliers in the Y variables. Similarly, when there are multiple X variables, the analagous situations hold regarding multivariate and univariate outliers. The residual plots will detect these (for X, the multivariate outliers from PROC PLS are called STDXSSE; for Y the multivariate outliers from PROC PLS are called STDYSSE).

I realize that for some people, this is un-intuitive and hard to understand, but this is another of PLS's great strengths.

I generally agree with your opinion. But I think that the concept of residuals and outliers are all independent on pairs of Xs and Ys. That is, I don't think there is such a thing as univariate or multivariate Y outlier unconditional on X. So, when we say an observation is a multivariate X outlier, we have to point out that it is on which dimension of Y that it is an outlier.

I disagree, a data point can be an outlier in the X direction regardless of which Y (if any) it is predictive of. Similarly, a data point can be an outlier in the Y direction regardless of which X might predict it (if any).

Yes, the phenomena you mentioned surely exists in reality. When I was typing my reply a few hours ago, I thought of these circumstances and concluded in my mind that these situations can be termed as "univariate outliers of a certain X on all dimensions of Y". The dimensions of Y on which the particular observed value of X of the particular observation is specified anyway. I thought it was merely a matter of the way you express such phenomena in words, so I did not specify it in my previous reply.

@PaigeMiller wrote:

I would like to further consult on the distribution of the regression coefficient estimates in linear PLS model. Do they follow a normal distribution?

There are no distributional assumptions involved in Partial Least Squares regression. Any test of significance is done via bootstrapping or similar method.

I ask this question because I am building a PLS model with missing data.

PLS with missing data can be handled in PROC PLS using the EM algorithm (option MISSING=EM) or by replacing a missing with the average of the non-missings (option MISSING=AVG).

Thank you for your information as well as the tip in PROC PLS! To the best of my knowledge, the EM and mean-imputation techniques are inferior to more advanced ones like multiple imputation with chained equation (MICE), which can be requested via the FCS statement in PROC MI. So generating the imputed sample is not a problem. Yet as I had pointed out, conducting the final pooling process deserves a second thought into the distribution of regression coefficient estimates. Now that it is distribution-free and hence may not necessarily follow normal distribution, I think it would be safer to resort to Box-Cox transformations before I pool them.

I found that I missed two questions: (1) so are standard errors of regression coefficient estimates of linear PLS models also computed by nonparametric methods like bootstrap? Is there a formula? (2) Could you recommend articles on computing the standard errors and/or confidence intervals of linear PLS models?

Many thanks!

PaigeMiller · Posted 10-24-2023 07:53 AM

There is no "formula" for bootstrap confidence intervals; rather there is an algorithm which relies on iteration (and usually requires computer programming to achieve, as at least in PROC PLS, there is no bootstrapping built in). See https://blogs.sas.com/content/iml/tag/bootstrap-and-resampling

I am not aware of articles that compute standard errors or confidence intervals for PLS Models. In fact, almost every usage of PLS I have seen published ignores this entirely; if the model cross-validation says that the entire model is statistically significant, then the rest of the questions you ask are essentially ignored. PLS was developed outside the statistical community, and now that the statistical community has begun using PLS, they ask these questions, but I don't think there are answers, other than bootstrap and similar methods.

In the special case of Logistic PLS, where the algorithm actually fits many univariate logistic regressions, you can get the chi-squared test result (p-value and/or confidence interval) for each coefficient in the loading vectors; and also for the overall regression coefficient.

--
Paige Miller

Way of handling categorical independent variables in partial least squares

Re: Way of handling categorical independent variables in partial least squares

Re: Way of handling categorical independent variables in partial least squares

Re: Way of handling categorical independent variables in partial least squares

Re: Way of handling categorical independent variables in partial least squares

Re: Way of handling categorical independent variables in partial least squares

Re: Way of handling categorical independent variables in partial least squares

Re: Way of handling categorical independent variables in partial least squares

Re: Way of handling categorical independent variables in partial least squares

Re: Way of handling categorical independent variables in partial least squares

Re: Way of handling categorical independent variables in partial least squares

Re: Way of handling categorical independent variables in partial least squares

Re: Way of handling categorical independent variables in partial least squares

Re: Way of handling categorical independent variables in partial least squares

Re: Way of handling categorical independent variables in partial least squares

Re: Way of handling categorical independent variables in partial least squares