BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
david27
Quartz | Level 8

Hello,

 

I am learning Proc Princomp for Principal Component Analysis.

 

I have this code:

ods graphics on;
proc princomp data=sashelp.class PLOTS=SCORE(ELLIPSE NCOMP=3) out=class_out outstat=class_stat;
run;
ods graphics off;

Which produces the attached output in chart:

Based on the chart produced-- Robert is an outlier as it falls outside the 95% border line.

 

How can i get that information in a dataset without referring to the chart?

What option am i missing here? 

My class_out dataset does not seem to identify that observation #16(Robert) as an outlier.

 

Please advise.

Thanks

1 ACCEPTED SOLUTION

Accepted Solutions
PaigeMiller
Diamond | Level 26

@david27 wrote:

Apologies but need to clarify the question.

 

How can I get the outlier information in a dataset?

For example:
Robert(obs=16) falls outside the 95% line. I want a dataset which has only this observation or atleast identifies this observation as falling outside the 95% line.

 

Also,

That brings another question:

Can we change that 95% threshold to 99% threshold or 90% threshold?


So, my apologies for my earlier answer being somewhat off target.

 

The ellipse that determines the "outliers" is actually a multivariate T-squared calculation, and isn't hard to get from the PCA outputs, but you have to know the steps.

 

It is easier to get the t-squared values for a PCA analysis using PROC PLS rather than using PROC PRINCOMP.

 

proc pls data=sashelp.class;
	model age height weight = age height weight;
	output out=pls_stats tsquare=tsq;
run;

To understand this, a PLS analysis where the x and y variables in the model are identical, produces a PCA analysis! And it produces the t-squared value, which then can be used to determine if the observation is inside or outside the ellipse.

 

Then, you compare t-squared to the ellipse value which is computed from the formula for the T-SQUARED_limit = p(n-1)*F/(n-p) where n is the number of data points (19), p is the number of dimensions (since the ellipses are drawn in two dimensions, I believe SAS used p=2, and F is the value from the F distribution table with p and n-p degrees of freedom.

 

data _null_;
    p=2;
    n=19;
    f=finv(0.95,p,n-p);
    tsq_lim=p*(n-1)*f/(n-p);
    put tsq_lim= f=;
run;

Obviously, if you don't want a 95% limit, and you want a 90% limit, you make the change above in FINV. To draw the ellipses with 90% confidence, you would change the alpha option PLOTS=(SCORE(ALPHA=10)) in PROC PRINCOMP.

 

For completeness, here are the calculations of t-squared from PROC PRINCOMP:

 

proc stdize data=class_out out=class_out1;
	var prin1-prin3;
run;

data tsquared;
	set class_out1;
	tsq=uss(of prin1-prin3);
run;

 

 

 

--
Paige Miller

View solution in original post

6 REPLIES 6
david27
Quartz | Level 8

Apologies but need to clarify the question.

 

How can I get the outlier information in a dataset?

For example:
Robert(obs=16) falls outside the 95% line. I want a dataset which has only this observation or atleast identifies this observation as falling outside the 95% line.

 

Also,

That brings another question:

Can we change that 95% threshold to 99% threshold or 90% threshold?

 

Thanks

 

 

 

PaigeMiller
Diamond | Level 26

@david27 wrote:

Apologies but need to clarify the question.

 

How can I get the outlier information in a dataset?

For example:
Robert(obs=16) falls outside the 95% line. I want a dataset which has only this observation or atleast identifies this observation as falling outside the 95% line.

 

Also,

That brings another question:

Can we change that 95% threshold to 99% threshold or 90% threshold?


So, my apologies for my earlier answer being somewhat off target.

 

The ellipse that determines the "outliers" is actually a multivariate T-squared calculation, and isn't hard to get from the PCA outputs, but you have to know the steps.

 

It is easier to get the t-squared values for a PCA analysis using PROC PLS rather than using PROC PRINCOMP.

 

proc pls data=sashelp.class;
	model age height weight = age height weight;
	output out=pls_stats tsquare=tsq;
run;

To understand this, a PLS analysis where the x and y variables in the model are identical, produces a PCA analysis! And it produces the t-squared value, which then can be used to determine if the observation is inside or outside the ellipse.

 

Then, you compare t-squared to the ellipse value which is computed from the formula for the T-SQUARED_limit = p(n-1)*F/(n-p) where n is the number of data points (19), p is the number of dimensions (since the ellipses are drawn in two dimensions, I believe SAS used p=2, and F is the value from the F distribution table with p and n-p degrees of freedom.

 

data _null_;
    p=2;
    n=19;
    f=finv(0.95,p,n-p);
    tsq_lim=p*(n-1)*f/(n-p);
    put tsq_lim= f=;
run;

Obviously, if you don't want a 95% limit, and you want a 90% limit, you make the change above in FINV. To draw the ellipses with 90% confidence, you would change the alpha option PLOTS=(SCORE(ALPHA=10)) in PROC PRINCOMP.

 

For completeness, here are the calculations of t-squared from PROC PRINCOMP:

 

proc stdize data=class_out out=class_out1;
	var prin1-prin3;
run;

data tsquared;
	set class_out1;
	tsq=uss(of prin1-prin3);
run;

 

 

 

--
Paige Miller
david27
Quartz | Level 8

Thank You very much @PaigeMiller 

 

You helped me understand Proc Princomp more and also provided an alternative- proc pls.

 

Thank You Again...

 

david27
Quartz | Level 8

Hello @PaigeMiller 

 

So coming back to this after long time.

Had a quick question on your comment- " p is the number of dimensions (since the ellipses are drawn in two dimensions, I believe SAS used p=2,"

 

Ellipses will always be drawn in 2-D. and that will make p=2 all the time.

 

If we take say 5 variables in our predictions for sashlep.cars

HORSEPOWER MPG_HIGHWAY WEIGHT LENGTH WHEELBASE

 

Do we have

p=2 because ellipses are drawn in 2-Dimensions?

OR

p=4 because we are taking 5 variables in our prediction(5-1)?

OR

p=5 because we have 5 dimensions- variables in our prediction?

PaigeMiller
Diamond | Level 26

If the question is: I see the ellipse on a two-dimensional plot and I want to know if points are outside the ellipse, then p=2.

 

But if you want to ask the question (which is entirely reasonable to ask) is this an outlier in 5 dimensional space, then p=5. You can't draw a 5-dimensional plot, but the question is answered the same way, and people will often plot the t-squared number against the limit — not a scatter plot, but more like a trend plot with an upper limit.

--
Paige Miller