Hello,
I am learning Proc Princomp for Principal Component Analysis.
I have this code:
ods graphics on;
proc princomp data=sashelp.class PLOTS=SCORE(ELLIPSE NCOMP=3) out=class_out outstat=class_stat;
run;
ods graphics off;
Which produces the attached output in chart:
Based on the chart produced-- Robert is an outlier as it falls outside the 95% border line.
How can i get that information in a dataset without referring to the chart?
What option am i missing here?
My class_out dataset does not seem to identify that observation #16(Robert) as an outlier.
Please advise.
Thanks
@david27 wrote:
Apologies but need to clarify the question.
How can I get the outlier information in a dataset?
For example:
Robert(obs=16) falls outside the 95% line. I want a dataset which has only this observation or atleast identifies this observation as falling outside the 95% line.
Also,
That brings another question:
Can we change that 95% threshold to 99% threshold or 90% threshold?
So, my apologies for my earlier answer being somewhat off target.
The ellipse that determines the "outliers" is actually a multivariate T-squared calculation, and isn't hard to get from the PCA outputs, but you have to know the steps.
It is easier to get the t-squared values for a PCA analysis using PROC PLS rather than using PROC PRINCOMP.
proc pls data=sashelp.class;
model age height weight = age height weight;
output out=pls_stats tsquare=tsq;
run;
To understand this, a PLS analysis where the x and y variables in the model are identical, produces a PCA analysis! And it produces the t-squared value, which then can be used to determine if the observation is inside or outside the ellipse.
Then, you compare t-squared to the ellipse value which is computed from the formula for the T-SQUARED_limit = p(n-1)*F/(n-p) where n is the number of data points (19), p is the number of dimensions (since the ellipses are drawn in two dimensions, I believe SAS used p=2, and F is the value from the F distribution table with p and n-p degrees of freedom.
data _null_;
p=2;
n=19;
f=finv(0.95,p,n-p);
tsq_lim=p*(n-1)*f/(n-p);
put tsq_lim= f=;
run;
Obviously, if you don't want a 95% limit, and you want a 90% limit, you make the change above in FINV. To draw the ellipses with 90% confidence, you would change the alpha option PLOTS=(SCORE(ALPHA=10)) in PROC PRINCOMP.
For completeness, here are the calculations of t-squared from PROC PRINCOMP:
proc stdize data=class_out out=class_out1;
var prin1-prin3;
run;
data tsquared;
set class_out1;
tsq=uss(of prin1-prin3);
run;
Anything in the PROC PRINCOMP (or any other PROC) output can be included in a SAS data set, using ODS OUTPUT.
Apologies but need to clarify the question.
How can I get the outlier information in a dataset?
For example:
Robert(obs=16) falls outside the 95% line. I want a dataset which has only this observation or atleast identifies this observation as falling outside the 95% line.
Also,
That brings another question:
Can we change that 95% threshold to 99% threshold or 90% threshold?
Thanks
@david27 wrote:
Apologies but need to clarify the question.
How can I get the outlier information in a dataset?
For example:
Robert(obs=16) falls outside the 95% line. I want a dataset which has only this observation or atleast identifies this observation as falling outside the 95% line.
Also,
That brings another question:
Can we change that 95% threshold to 99% threshold or 90% threshold?
So, my apologies for my earlier answer being somewhat off target.
The ellipse that determines the "outliers" is actually a multivariate T-squared calculation, and isn't hard to get from the PCA outputs, but you have to know the steps.
It is easier to get the t-squared values for a PCA analysis using PROC PLS rather than using PROC PRINCOMP.
proc pls data=sashelp.class;
model age height weight = age height weight;
output out=pls_stats tsquare=tsq;
run;
To understand this, a PLS analysis where the x and y variables in the model are identical, produces a PCA analysis! And it produces the t-squared value, which then can be used to determine if the observation is inside or outside the ellipse.
Then, you compare t-squared to the ellipse value which is computed from the formula for the T-SQUARED_limit = p(n-1)*F/(n-p) where n is the number of data points (19), p is the number of dimensions (since the ellipses are drawn in two dimensions, I believe SAS used p=2, and F is the value from the F distribution table with p and n-p degrees of freedom.
data _null_;
p=2;
n=19;
f=finv(0.95,p,n-p);
tsq_lim=p*(n-1)*f/(n-p);
put tsq_lim= f=;
run;
Obviously, if you don't want a 95% limit, and you want a 90% limit, you make the change above in FINV. To draw the ellipses with 90% confidence, you would change the alpha option PLOTS=(SCORE(ALPHA=10)) in PROC PRINCOMP.
For completeness, here are the calculations of t-squared from PROC PRINCOMP:
proc stdize data=class_out out=class_out1;
var prin1-prin3;
run;
data tsquared;
set class_out1;
tsq=uss(of prin1-prin3);
run;
Thank You very much @PaigeMiller
You helped me understand Proc Princomp more and also provided an alternative- proc pls.
Thank You Again...
Hello @PaigeMiller
So coming back to this after long time.
Had a quick question on your comment- " p is the number of dimensions (since the ellipses are drawn in two dimensions, I believe SAS used p=2,"
Ellipses will always be drawn in 2-D. and that will make p=2 all the time.
If we take say 5 variables in our predictions for sashlep.cars
HORSEPOWER MPG_HIGHWAY WEIGHT LENGTH WHEELBASE
Do we have
p=2 because ellipses are drawn in 2-Dimensions?
OR
p=4 because we are taking 5 variables in our prediction(5-1)?
OR
p=5 because we have 5 dimensions- variables in our prediction?
If the question is: I see the ellipse on a two-dimensional plot and I want to know if points are outside the ellipse, then p=2.
But if you want to ask the question (which is entirely reasonable to ask) is this an outlier in 5 dimensional space, then p=5. You can't draw a 5-dimensional plot, but the question is answered the same way, and people will often plot the t-squared number against the limit — not a scatter plot, but more like a trend plot with an upper limit.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.