Solved: Re: Proc Princomp - Outlier Observations identification

david27 · Posted 03-29-2019 12:49 AM

Hello,

I am learning Proc Princomp for Principal Component Analysis.

I have this code:

ods graphics on;
proc princomp data=sashelp.class PLOTS=SCORE(ELLIPSE NCOMP=3) out=class_out outstat=class_stat;
run;
ods graphics off;

Which produces the attached output in chart:

Based on the chart produced-- Robert is an outlier as it falls outside the 95% border line.

How can i get that information in a dataset without referring to the chart?

What option am i missing here?

My class_out dataset does not seem to identify that observation #16(Robert) as an outlier.

Please advise.

Thanks

PaigeMiller · Posted 03-29-2019 10:00 AM

@david27 wrote:

Apologies but need to clarify the question.

How can I get the outlier information in a dataset?

For example:
Robert(obs=16) falls outside the 95% line. I want a dataset which has only this observation or atleast identifies this observation as falling outside the 95% line.

Also,

That brings another question:

Can we change that 95% threshold to 99% threshold or 90% threshold?

So, my apologies for my earlier answer being somewhat off target.

The ellipse that determines the "outliers" is actually a multivariate T-squared calculation, and isn't hard to get from the PCA outputs, but you have to know the steps.

It is easier to get the t-squared values for a PCA analysis using PROC PLS rather than using PROC PRINCOMP.

proc pls data=sashelp.class;
	model age height weight = age height weight;
	output out=pls_stats tsquare=tsq;
run;

To understand this, a PLS analysis where the x and y variables in the model are identical, produces a PCA analysis! And it produces the t-squared value, which then can be used to determine if the observation is inside or outside the ellipse.

Then, you compare t-squared to the ellipse value which is computed from the formula for the T-SQUARED_limit = p(n-1)*F/(n-p) where n is the number of data points (19), p is the number of dimensions (since the ellipses are drawn in two dimensions, I believe SAS used p=2, and F is the value from the F distribution table with p and n-p degrees of freedom.

data _null_;
    p=2;
    n=19;
    f=finv(0.95,p,n-p);
    tsq_lim=p*(n-1)*f/(n-p);
    put tsq_lim= f=;
run;

Obviously, if you don't want a 95% limit, and you want a 90% limit, you make the change above in FINV. To draw the ellipses with 90% confidence, you would change the alpha option PLOTS=(SCORE(ALPHA=10)) in PROC PRINCOMP.

For completeness, here are the calculations of t-squared from PROC PRINCOMP:

proc stdize data=class_out out=class_out1;
	var prin1-prin3;
run;

data tsquared;
	set class_out1;
	tsq=uss(of prin1-prin3);
run;

--
Paige Miller

View solution in original post

PaigeMiller · Posted 03-29-2019 06:55 AM

Anything in the PROC PRINCOMP (or any other PROC) output can be included in a SAS data set, using ODS OUTPUT.

https://documentation.sas.com/?docsetId=odsug&docsetTarget=p0oxrbinw6fjuwn1x23qam6dntyd.htm&docsetVe...

https://documentation.sas.com/?docsetId=statug&docsetTarget=statug_princomp_details07.htm&docsetVers...

--
Paige Miller

david27 · Posted 03-29-2019 08:48 AM

Apologies but need to clarify the question.

How can I get the outlier information in a dataset?

For example:
Robert(obs=16) falls outside the 95% line. I want a dataset which has only this observation or atleast identifies this observation as falling outside the 95% line.

Also,

That brings another question:

Can we change that 95% threshold to 99% threshold or 90% threshold?

Thanks

PaigeMiller · Posted 03-29-2019 10:00 AM

@david27 wrote:

Apologies but need to clarify the question.

How can I get the outlier information in a dataset?

For example:
Robert(obs=16) falls outside the 95% line. I want a dataset which has only this observation or atleast identifies this observation as falling outside the 95% line.

Also,

That brings another question:

Can we change that 95% threshold to 99% threshold or 90% threshold?

So, my apologies for my earlier answer being somewhat off target.

The ellipse that determines the "outliers" is actually a multivariate T-squared calculation, and isn't hard to get from the PCA outputs, but you have to know the steps.

It is easier to get the t-squared values for a PCA analysis using PROC PLS rather than using PROC PRINCOMP.

proc pls data=sashelp.class;
	model age height weight = age height weight;
	output out=pls_stats tsquare=tsq;
run;

To understand this, a PLS analysis where the x and y variables in the model are identical, produces a PCA analysis! And it produces the t-squared value, which then can be used to determine if the observation is inside or outside the ellipse.

Then, you compare t-squared to the ellipse value which is computed from the formula for the T-SQUARED_limit = p(n-1)*F/(n-p) where n is the number of data points (19), p is the number of dimensions (since the ellipses are drawn in two dimensions, I believe SAS used p=2, and F is the value from the F distribution table with p and n-p degrees of freedom.

data _null_;
    p=2;
    n=19;
    f=finv(0.95,p,n-p);
    tsq_lim=p*(n-1)*f/(n-p);
    put tsq_lim= f=;
run;

Obviously, if you don't want a 95% limit, and you want a 90% limit, you make the change above in FINV. To draw the ellipses with 90% confidence, you would change the alpha option PLOTS=(SCORE(ALPHA=10)) in PROC PRINCOMP.

For completeness, here are the calculations of t-squared from PROC PRINCOMP:

proc stdize data=class_out out=class_out1;
	var prin1-prin3;
run;

data tsquared;
	set class_out1;
	tsq=uss(of prin1-prin3);
run;

--
Paige Miller

david27 · Posted 03-29-2019 11:48 AM

Thank You very much @PaigeMiller

You helped me understand Proc Princomp more and also provided an alternative- proc pls.

Thank You Again...

david27 · Posted 06-03-2020 10:13 AM

Hello @PaigeMiller

So coming back to this after long time.

Had a quick question on your comment- " p is the number of dimensions (since the ellipses are drawn in two dimensions, I believe SAS used p=2,"

Ellipses will always be drawn in 2-D. and that will make p=2 all the time.

If we take say 5 variables in our predictions for sashlep.cars

HORSEPOWER MPG_HIGHWAY WEIGHT LENGTH WHEELBASE

Do we have

p=2 because ellipses are drawn in 2-Dimensions?

OR

p=4 because we are taking 5 variables in our prediction(5-1)?

OR

p=5 because we have 5 dimensions- variables in our prediction?

PaigeMiller · Posted 06-03-2020 11:24 AM

If the question is: I see the ellipse on a two-dimensional plot and I want to know if points are outside the ellipse, then p=2.

But if you want to ask the question (which is entirely reasonable to ask) is this an outlier in 5 dimensional space, then p=5. You can't draw a 5-dimensional plot, but the question is answered the same way, and people will often plot the t-squared number against the limit — not a scatter plot, but more like a trend plot with an upper limit.

--
Paige Miller