BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
david27
Quartz | Level 8

Hello,

 

I am learning Proc Princomp for Principal Component Analysis.

 

I have this code:

ods graphics on;
proc princomp data=sashelp.class PLOTS=SCORE(ELLIPSE NCOMP=3) out=class_out outstat=class_stat;
run;
ods graphics off;

Which produces the attached output in chart:

Based on the chart produced-- Robert is an outlier as it falls outside the 95% border line.

 

How can i get that information in a dataset without referring to the chart?

What option am i missing here? 

My class_out dataset does not seem to identify that observation #16(Robert) as an outlier.

 

Please advise.

Thanks

1 ACCEPTED SOLUTION

Accepted Solutions
PaigeMiller
Diamond | Level 26

@david27 wrote:

Apologies but need to clarify the question.

 

How can I get the outlier information in a dataset?

For example:
Robert(obs=16) falls outside the 95% line. I want a dataset which has only this observation or atleast identifies this observation as falling outside the 95% line.

 

Also,

That brings another question:

Can we change that 95% threshold to 99% threshold or 90% threshold?


So, my apologies for my earlier answer being somewhat off target.

 

The ellipse that determines the "outliers" is actually a multivariate T-squared calculation, and isn't hard to get from the PCA outputs, but you have to know the steps.

 

It is easier to get the t-squared values for a PCA analysis using PROC PLS rather than using PROC PRINCOMP.

 

proc pls data=sashelp.class;
	model age height weight = age height weight;
	output out=pls_stats tsquare=tsq;
run;

To understand this, a PLS analysis where the x and y variables in the model are identical, produces a PCA analysis! And it produces the t-squared value, which then can be used to determine if the observation is inside or outside the ellipse.

 

Then, you compare t-squared to the ellipse value which is computed from the formula for the T-SQUARED_limit = p(n-1)*F/(n-p) where n is the number of data points (19), p is the number of dimensions (since the ellipses are drawn in two dimensions, I believe SAS used p=2, and F is the value from the F distribution table with p and n-p degrees of freedom.

 

data _null_;
    p=2;
    n=19;
    f=finv(0.95,p,n-p);
    tsq_lim=p*(n-1)*f/(n-p);
    put tsq_lim= f=;
run;

Obviously, if you don't want a 95% limit, and you want a 90% limit, you make the change above in FINV. To draw the ellipses with 90% confidence, you would change the alpha option PLOTS=(SCORE(ALPHA=10)) in PROC PRINCOMP.

 

For completeness, here are the calculations of t-squared from PROC PRINCOMP:

 

proc stdize data=class_out out=class_out1;
	var prin1-prin3;
run;

data tsquared;
	set class_out1;
	tsq=uss(of prin1-prin3);
run;

 

 

 

--
Paige Miller

View solution in original post

6 REPLIES 6
david27
Quartz | Level 8

Apologies but need to clarify the question.

 

How can I get the outlier information in a dataset?

For example:
Robert(obs=16) falls outside the 95% line. I want a dataset which has only this observation or atleast identifies this observation as falling outside the 95% line.

 

Also,

That brings another question:

Can we change that 95% threshold to 99% threshold or 90% threshold?

 

Thanks

 

 

 

PaigeMiller
Diamond | Level 26

@david27 wrote:

Apologies but need to clarify the question.

 

How can I get the outlier information in a dataset?

For example:
Robert(obs=16) falls outside the 95% line. I want a dataset which has only this observation or atleast identifies this observation as falling outside the 95% line.

 

Also,

That brings another question:

Can we change that 95% threshold to 99% threshold or 90% threshold?


So, my apologies for my earlier answer being somewhat off target.

 

The ellipse that determines the "outliers" is actually a multivariate T-squared calculation, and isn't hard to get from the PCA outputs, but you have to know the steps.

 

It is easier to get the t-squared values for a PCA analysis using PROC PLS rather than using PROC PRINCOMP.

 

proc pls data=sashelp.class;
	model age height weight = age height weight;
	output out=pls_stats tsquare=tsq;
run;

To understand this, a PLS analysis where the x and y variables in the model are identical, produces a PCA analysis! And it produces the t-squared value, which then can be used to determine if the observation is inside or outside the ellipse.

 

Then, you compare t-squared to the ellipse value which is computed from the formula for the T-SQUARED_limit = p(n-1)*F/(n-p) where n is the number of data points (19), p is the number of dimensions (since the ellipses are drawn in two dimensions, I believe SAS used p=2, and F is the value from the F distribution table with p and n-p degrees of freedom.

 

data _null_;
    p=2;
    n=19;
    f=finv(0.95,p,n-p);
    tsq_lim=p*(n-1)*f/(n-p);
    put tsq_lim= f=;
run;

Obviously, if you don't want a 95% limit, and you want a 90% limit, you make the change above in FINV. To draw the ellipses with 90% confidence, you would change the alpha option PLOTS=(SCORE(ALPHA=10)) in PROC PRINCOMP.

 

For completeness, here are the calculations of t-squared from PROC PRINCOMP:

 

proc stdize data=class_out out=class_out1;
	var prin1-prin3;
run;

data tsquared;
	set class_out1;
	tsq=uss(of prin1-prin3);
run;

 

 

 

--
Paige Miller
david27
Quartz | Level 8

Thank You very much @PaigeMiller 

 

You helped me understand Proc Princomp more and also provided an alternative- proc pls.

 

Thank You Again...

 

david27
Quartz | Level 8

Hello @PaigeMiller 

 

So coming back to this after long time.

Had a quick question on your comment- " p is the number of dimensions (since the ellipses are drawn in two dimensions, I believe SAS used p=2,"

 

Ellipses will always be drawn in 2-D. and that will make p=2 all the time.

 

If we take say 5 variables in our predictions for sashlep.cars

HORSEPOWER MPG_HIGHWAY WEIGHT LENGTH WHEELBASE

 

Do we have

p=2 because ellipses are drawn in 2-Dimensions?

OR

p=4 because we are taking 5 variables in our prediction(5-1)?

OR

p=5 because we have 5 dimensions- variables in our prediction?

PaigeMiller
Diamond | Level 26

If the question is: I see the ellipse on a two-dimensional plot and I want to know if points are outside the ellipse, then p=2.

 

But if you want to ask the question (which is entirely reasonable to ask) is this an outlier in 5 dimensional space, then p=5. You can't draw a 5-dimensional plot, but the question is answered the same way, and people will often plot the t-squared number against the limit — not a scatter plot, but more like a trend plot with an upper limit.

--
Paige Miller

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 1787 views
  • 0 likes
  • 2 in conversation