Solved: Re: Way of handling categorical independent variables in partial least... - Page 2

Rick_SAS · Posted 10-18-2023 11:28 AM

Yes, that is correct. All regression problems create a design matrix and then operate on that matrix. Classification variables are converted to dummy variables by using the GLM parameterization. The columns of the design matrix are then (by default) centered and scaled without consideration of how the columns were created.

You can see this by modifying an example from the doc to include a CLASS variable. Run PROC PLM to get the parameter estimates. Then use PROC GLMMOD or some other procedure to generate the design matrix and use PROC STDIZE to center and scale the columns. If you then use the columns of the design matrix in PROC PLM, you will get the same parameter estimates:

/* example from PROC PLM doc, but add new CLASS variable, C */
data pentaTrain;
   input obsnam $ S1 L1 P1 S2 L2 P2
                  S3 L3 P3 S4 L4 P4
                  S5 L5 P5  log_RAI @@;
   n = _n_;
   call streaminit(123);
   C = rand("Table", 0.4, 0.3, 0.3);
   if C=2 then log_RAI = log( exp(log_RAI)+2 );
   else if C=3 then log_RAI = log( exp(log_RAI)+3 );
   datalines;
VESSK    -2.6931 -2.5271 -1.2871  3.0777  0.3891 -0.0701
          1.9607 -1.6324  0.5746  1.9607 -1.6324  0.5746
          2.8369  1.4092 -3.1398                    0.00
VESAK    -2.6931 -2.5271 -1.2871  3.0777  0.3891 -0.0701
          1.9607 -1.6324  0.5746  0.0744 -1.7333  0.0902
          2.8369  1.4092 -3.1398                    0.28
VEASK    -2.6931 -2.5271 -1.2871  3.0777  0.3891 -0.0701
          0.0744 -1.7333  0.0902  1.9607 -1.6324  0.5746
          2.8369  1.4092 -3.1398                    0.20
VEAAK    -2.6931 -2.5271 -1.2871  3.0777  0.3891 -0.0701
          0.0744 -1.7333  0.0902  0.0744 -1.7333  0.0902
          2.8369  1.4092 -3.1398                    0.51
VKAAK    -2.6931 -2.5271 -1.2871  2.8369  1.4092 -3.1398
          0.0744 -1.7333  0.0902  0.0744 -1.7333  0.0902
          2.8369  1.4092 -3.1398                    0.11
VEWAK    -2.6931 -2.5271 -1.2871  3.0777  0.3891 -0.0701
         -4.7548  3.6521  0.8524  0.0744 -1.7333  0.0902
          2.8369  1.4092 -3.1398                    2.73
VEAAP    -2.6931 -2.5271 -1.2871  3.0777  0.3891 -0.0701
          0.0744 -1.7333  0.0902  0.0744 -1.7333  0.0902
         -1.2201  0.8829  2.2253                    0.18
VEHAK    -2.6931 -2.5271 -1.2871  3.0777  0.3891 -0.0701
          2.4064  1.7438  1.1057  0.0744 -1.7333  0.0902
          2.8369  1.4092 -3.1398                    1.53
VAAAK    -2.6931 -2.5271 -1.2871  0.0744 -1.7333  0.0902
          0.0744 -1.7333  0.0902  0.0744 -1.7333  0.0902
          2.8369  1.4092 -3.1398                   -0.10
GEAAK     2.2261 -5.3648  0.3049  3.0777  0.3891 -0.0701
          0.0744 -1.7333  0.0902  0.0744 -1.7333  0.0902
          2.8369  1.4092 -3.1398                   -0.52
LEAAK    -4.1921 -1.0285 -0.9801  3.0777  0.3891 -0.0701
          0.0744 -1.7333  0.0902  0.0744 -1.7333  0.0902
          2.8369  1.4092 -3.1398                    0.40
FEAAK    -4.9217  1.2977  0.4473  3.0777  0.3891 -0.0701
          0.0744 -1.7333  0.0902  0.0744 -1.7333  0.0902
          2.8369  1.4092 -3.1398                    0.30
VEGGK    -2.6931 -2.5271 -1.2871  3.0777  0.3891 -0.0701
          2.2261 -5.3648  0.3049  2.2261 -5.3648  0.3049
          2.8369  1.4092 -3.1398                   -1.00
VEFAK    -2.6931 -2.5271 -1.2871  3.0777  0.3891 -0.0701
         -4.9217  1.2977  0.4473  0.0744 -1.7333  0.0902
          2.8369  1.4092 -3.1398                    1.57
VELAK    -2.6931 -2.5271 -1.2871  3.0777  0.3891 -0.0701
         -4.1921 -1.0285 -0.9801  0.0744 -1.7333  0.0902
          2.8369  1.4092 -3.1398                    0.59
;

/* run PROC PLM by using the CLASS stmt */
proc pls data=pentaTrain;
   class C;
   model log_RAI = C S1-S5 L1-L5 P1-P5 / solution;
   ods select CenScaleParms ParameterEstimates;
run;

/* generate the design matrix */
proc glmmod data=pentaTrain outdesign=Design;
   class C;
   model log_RAI = C S1-S5 L1-L5 P1-P5 / noint;
run;

/* center and scale the design matrix */
proc stdize data=Design out=StdDesign;
run;

/* rerun PROC PLM on the columns of the design matrix */
proc pls data=StdDesign;
   model log_RAI = Col: / solution;
   ods select CenScaleParms ParameterEstimates;
run;

Season · Posted 10-18-2023 11:45 AM

Hello, Rick. Thank you for your prompt reply as well! I also express my sincere gratitude to your time and effort spent on arranging the example codes.

I would like to further ask about the residual of linear PLS modeling. I noticed that SAS could compute the residual plots upon request. Since PLS does not alter the linear nature of the model, does the requirement of residual of the model of each of the dependent variables following the Gauss-Markov assumption still exists?

Rick_SAS · Posted 10-18-2023 01:11 PM

I think Paige has provided an excellent response to this question. The only thing I would add is that the magnitude of residuals provides an estimate of the deviations between the data and the model They can also help to identify observations that do not fit the model (outliers). Neither of those uses requires distributional assumptions.

Rick_SAS · Posted 10-18-2023 01:32 PM

Regarding the standardization of categorical variables, I wrote down some of my thoughts about standardizing CLASS variables in regression models. See the section of the article, "Interpretation of standardized coefficients for categorical variables." Spoiler: In most cases, I don't think you should do it.

Season · Posted 10-21-2023 12:28 AM

Thank you, Rick, for your reply and the link of your article you offered!

I would like to further consult on the distribution of the regression coefficient estimates in linear PLS model. Do they follow a normal distribution?

I ask this question because I am building a PLS model with missing data. I need to pool the regression coefficients after modeling in each imputed sample. If the regression coefficients did not follow a normal distribution, then they could not be pooled directly.

Thank you!

Re: Way of handling categorical independent variables in partial least squares

Re: Way of handling categorical independent variables in partial least squares

Re: Way of handling categorical independent variables in partial least squares

Re: Way of handling categorical independent variables in partial least squares

Re: Way of handling categorical independent variables in partial least squares