BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
jky
Fluorite | Level 6 jky
Fluorite | Level 6

Hi,

 

I hope you're well.

 

I am trying to run proc pls on a dataset (description shown below) for one factor (nfac = 1). I have read that for a single factor, the results (include xweights) should be the same regardless of the algorithm/method used (NIPALS or SIMPLS should produce the same results). Although the coefficients and R-squared values match with library in opensource in R/Python, though the xweights are slightly different. Does anyone know if SAS applies any additional rules that other open-source packages do not? Ex. do SAS use specific type of SVD/norms......  I don't think this is due to whether the data is mean-centered or scaled, as I have carefully checked this.


About my data:

1. Only one response variable

2. 60 independent variables

3. 10 observations

(it's high dimension with very low observations)

 

Thanks,

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
PaigeMiller
Diamond | Level 26

The difference in the weights is just a scaling factor. The SAS value divided by the R value is always approx 1.042. SAS and R obviously have scaled the weights differently, but it makes no difference at all to anything. (Why? Because these are multiplied by what SAS calls the "Inner Regression Coefficient" to obtain predicted values, the scaling of this value is also adjusted to get the proper predictions; and R uses a different scaling there so these "Inner Regression Coefficients" don't match between R and SAS, but when the multiplication happens the scaling difference cancel out).

--
Paige Miller

View solution in original post

14 REPLIES 14
Ksharp
Super User
The eigenvalue and eigenvector generated by sas or result from SVD is different than R or other package.
@Rick_SAS mentioned it before. Rick might explain it more details for you .
jky
Fluorite | Level 6 jky
Fluorite | Level 6

Hi Ksharp,

 

Thanks for your reply! Interesting, I would like to learn more about how SAS compute SVD differently from Python or R.

 

Thanks,

Rick_SAS
SAS Super FREQ

>  I would like to learn more about how SAS compute SVD differently from Python or R.

KSharp said that SAS and R compute eigenvectors differently, but what he should have said is that eigenvectors (and singular vectors) are not unique. Even after you standardize the eigenvectors to have unit norm, there is still a non-uniqueness property because if v is an eigenvector than so is -v. For the same problem, SAS might produce one (correct) eigenvector whereas R produces a different (equally correct) eigenvector that has the opposite sign. 

 

Eigenvalues (and singular value) are unique, so those will agree up to some number of decimals.

 

And if this problem gets worse if there are non-unique eigenvalues because then the basis for the eigenspace does not have a unique representation.

Rick_SAS
SAS Super FREQ

> ... the xweights are slightly different. 

Please post an example. For example, post the first 5 weights from SAS and the same weights from open source. Are the differences as that SAS reports the numbers to 6 digits whereas R reports 8 digits? Or are some weights in SAS the opposite sign as the weights in R? Or something else?

 

Also, please post your SAS code.

 

I encourage you to check the number of PCA components used for each model. Perhaps SAS is basing the model on k PCA components and your R model is using a different number of components.

 

 

jky
Fluorite | Level 6 jky
Fluorite | Level 6

Hi Rick, 

 

Thanks so much for getting back to me.

My SAS code is shown as below:

proc pls data=regress method=simpls nfac = 1;
  model Y = A01-A60;
  run;
  ods output 
    XWeights       = work.pls_xweights
run;

And here is the comparison of xweights getting from SAS vs Python for the first five dependent variables

 

XweightsT01T02T03T04T05
SAS-0.0726580.1722630.132138-0.064940.225583
Python-0.0697333190.1653275330.12681782-0.0623252870.216501385

 

A quick look I don't think it's the issue of different decimal places between Python or SAS, or with different signs. And the number of components here is one factor only, therefore I don't think their SVD will be different.

 

 

Thanks,

 

 

 

 

PaigeMiller
Diamond | Level 26

The difference in the weights is just a scaling factor. The SAS value divided by the R value is always approx 1.042. SAS and R obviously have scaled the weights differently, but it makes no difference at all to anything. (Why? Because these are multiplied by what SAS calls the "Inner Regression Coefficient" to obtain predicted values, the scaling of this value is also adjusted to get the proper predictions; and R uses a different scaling there so these "Inner Regression Coefficients" don't match between R and SAS, but when the multiplication happens the scaling difference cancel out).

--
Paige Miller
jky
Fluorite | Level 6 jky
Fluorite | Level 6

Thank you so much Rick. Yeah, I agree that it might be because different software scales xweights differently. And yes, the xweight difference doesn't affect the coefficient or R-squared value, but it does make the VIP scores slightly different, which is a little annoying.

Does anyone know how SAS calculates norms? As from what I understand, xweight is obtained from the SVD of X'Y (which has the shape 1 × the number of independent variables). With only one dimension/observation, the easiest way to compute the SVD is by dividing X'Y by its norm. However, I understand (with my limited knowledge on linear algebra) that with only one dimension, there may not be a unique solution, but would be good to know how SAS process it if possible (Ex. with the inner regression coefficient)

PaigeMiller
Diamond | Level 26

The VIP scores are computed via the formula in the code here https://support.sas.com/kb/25/009.html

Specifically, you should look at the code for the %GET_VIP macro, the scaling there looks pretty standard. (UPDATE: the scaling is exactly what @Rick_SAS says)


You are claiming "the xweight difference doesn't affect the coefficient or R-squared value, but it does make the VIP scores slightly different". I would like to see an example that isn't just a difference in scaling.

 

 

--
Paige Miller
Rick_SAS
SAS Super FREQ

> Does anyone know how SAS calculates norms? 

Unless the documentation states otherwise, you may assume that a vector norm is the L2 norm (aka, the Euclidean norm). So

||v|| = sqrt(v1^2 + v2^2 + ... + vn^2)

For example, see the L2 norm definition here: SAS Help Center: NORM Function

jky
Fluorite | Level 6 jky
Fluorite | Level 6

Hi Paige,

I hope you had a good weekend.

 

Yeah, the difference in VIP still lies in the scaling. However, since we sometimes interpret VIP score by comparing an absolute value—for example, as a general rule of thumb, if VIP > 1, it indicates a significant variable that impacts the dependent variable—Then this cause a problem as T03 is considered an important variable in SAS but not in Python. Therefore, if possible, it would be helpful to understand how SAS performs the scaling. However, if this is not feasible, it should be fine, as the difference is small.

 

VariableSAS_XWEIGHTPython_XWEIGHTPYTHON_VIPSAS_VIP
T01-0.0727-0.06970.54020.5628
T020.17230.16531.28061.3343
T030.13210.12680.98231.0235
T04-0.0649-0.06230.48280.5030
T050.22560.21651.67701.7474

 

Thanks,

PaigeMiller
Diamond | Level 26

Since VIP is not unique, it can vary by a scaling constant, comparing to an absolute value doesn't make sense. I compare VIP of a variable to the VIPs of the other variables, with "biggest" VIPs being the ones I concentrate on.

 

it would be helpful to understand how SAS performs the scaling

 

I gave you a link with code. @Rick_SAS explained how SAS calculates this.

--
Paige Miller
Rick_SAS
SAS Super FREQ

Perhaps it would be helpful to understand how Python performs the scaling?

jky
Fluorite | Level 6 jky
Fluorite | Level 6

Hi Rick,


Thank you. Yeah, I have explored how Python calculates Xweights as well, and it seems that it also uses the Euclidean norm. To clarify, I have created a simple raw dataset and SAS code and shared the Xweights (the first component of the SVD of X'Y or X'YY'X) with you below. You will see that the norm of Xweights in Python is 1 (I'm not sure, but I think this proves that it uses the Euclidean norm?), while the norm of Xweights in SAS is always around 1.04XXX. (Again, I have applied mean centering and scaling to the data, so that is not an issue of the difference, it's also not about the algorithm difference as with number of factor = 1, NIPALS algorithm will be the same as SIMPLS, which I have double checked too, it's not a flip of sign as well)

DATA regress;
    INPUT Y X1 X2 X3 X4 X5;
    DATALINES;
7 0 23 3 4 1
8 2 7 2 3 2
2 0 8 8 3 3
6 0 9 2 5 4
5 0 1 5 2 5
;
RUN;

PROC PLS DATA=regress METHOD=SIMPLS nfac = 1 details varss ;
    MODEL Y = X1 X2 X3 X4 X5;
RUN;

ods output 
    XWeights = work.pls_xweights
run;

 

XweightsX1X2X3X4X5
SAS0.4875150.259817-0.7838950.223089-0.344725
Python0.4673255660.249056761-0.7514313320.213850196-0.330449077

 

 

PaigeMiller
Diamond | Level 26

Euclidean norm can have a value of 1, or some other value. Any norm can have a value of 1 or some other value.

 

If you divide the weights by the norm, then they should produce a vector with norm of 1. SAS is obviously not dividing the weights by the norm. Python must be dividing the weights by the norm. It's optional whether a PLS program does this or not, because it doesn't affect the predicted values or the model fit. SAS obviously applies the scaling factor later in the algorithm than Python does. So I conclude that SAS and Python are calculating the weights the exact same way (is that what you need to know?) and then Python scales them but SAS doesn't.

 

Here is simple data step code which finds the Euclidean norm of the weights, and then re-scales the weights by dividing by the norm, so that you can see the difference, and how after you do the division, the norm becomes 1.

 

DATA regress;
    INPUT Y X1 X2 X3 X4 X5;
    DATALINES;
7 0 23 3 4 1
8 2 7 2 3 2
2 0 8 8 3 3
6 0 9 2 5 4
5 0 1 5 2 5
;
RUN;

PROC PLS DATA=regress nfac = 1 details varss ;
ods output 
    XWeights = work.pls_xweights;
MODEL Y = X1 X2 X3 X4 X5;
RUN;

data ssq;
    set pls_xweights;
    norm_x=sqrt(uss(of x1-x5));
    y1=x1/norm_x;
    y2=x2/norm_x;
    y3=x3/norm_x;
    y4=x4/norm_x;
    y5=x5/norm_x;
    norm_y=sqrt(uss(of y1-y5));
run;

 

--
Paige Miller

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 14 replies
  • 3819 views
  • 21 likes
  • 4 in conversation