Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

🔒 This topic is **solved** and **locked**.
Need further help from the community? Please
sign in and ask a **new** question.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 07-12-2015 09:35 AM
(1644 views)

How sensitive is SAS/IML to two things:

- Extreme values or outliers in the data -- does this influence the eigenvalues?

- Very large, high dimensional data -- how much slower is the routine?

Any insights into these questions would be most appreciated.sas

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

The SVD on X (or, similarly, a principal component analysis on X`X) is a least-squarse technique, and consequently influenced by outliers just as OLS regression is. You can use the MCD routine to create a ROBUST prinicipal component analysis, as shown in Wicklin 2010.

The SVD and eigenvalue routines can be computationally expensive. You can use the techniques shown in the article "The Power Method" to time the computations for various size inputs. The article shows a graph for the eigenvalue computations on square NxN matrices. The time increases superlinearly. Run the program on your hardware to get estimates that are customerized to your configuration.

10 REPLIES 10

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

The SVD on X (or, similarly, a principal component analysis on X`X) is a least-squarse technique, and consequently influenced by outliers just as OLS regression is. You can use the MCD routine to create a ROBUST prinicipal component analysis, as shown in Wicklin 2010.

The SVD and eigenvalue routines can be computationally expensive. You can use the techniques shown in the article "The Power Method" to time the computations for various size inputs. The article shows a graph for the eigenvalue computations on square NxN matrices. The time increases superlinearly. Run the program on your hardware to get estimates that are customerized to your configuration.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Rick-

That's fantastic. Thank you!

I just have one more question: there's an approach to PCA that leverages the Lanczos algorithm as described here (https://en.wikipedia.org/wiki/Lanczos_algorithm ) with the complete method described here ( [1412.6506] Cauchy Principal Component Analysis ).

Any chance the Lanczos algorithm has a pre-existing routine or module in SAS/IML?

Thomas

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Ok, thanks again.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Here's a blog post that directly addresses your question about timing eigenvalue computations:

http://blogs.sas.com/content/iml/2015/07/13/performance-of-algorithms.html

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Great! Thank you...

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Rick-

Thanks again for your help with developing a robust PCA. I'm a novice user of IML. I'm working with the SAS document you wrote --* Paper 329-2010 **Rediscovering SAS/IML® Software: Modern Data Analysis for the Practicing Statistician.* In addition, I have your book on *Statistical Programming with SAS/IML.* The matrix I'm analyzing for the robust PCA is 40 columns by 900 rows. I have a question for you about some code in your paper:

On page 9 you write:

/* compute PCA of robust covariance matrix */

p = ncol(x); /* 20 variables */

RobustCov = est[3:2+p, ]; /* robust estimate of shape parameters */

call eigen(eVal, eVec, RobustCov); /* PCA = eigenvectors of RobustCov */

I can't find any documentation on the "est" function in your book. So, when you write

RobustCov = est[3:2+p, ];

What exactly is that line of IML code doing? Is the "3:2+p" based on the sample data you're using? If I want to rewrite that line of code for the 40x900 matrix I'm working with, what would it look like? Currently, I'm getting errors with a simple replication of your code:

Cut and paste of IML error messages in my SAS log:

p=ncol(x);

RobustCov=ext[3:2+p, ];

ERROR: (execution) Matrix has not been set to a value.

operation : [ at line 1231 column 14

operands : ext, *LIT1001, _TEM1001,

ext 0 row 0 col (type ?, size 0)

*LIT1001 1 row 1 col (numeric)

3

_TEM1001 1 row 1 col (numeric)

39

statement : ASSIGN at line 1231 column 1

1232 call eigen(eVal, Evec, RobustCov);

ERROR: (execution) Character argument should be numeric.

operation : EIGEN at line 1232 column 1

operands : RobustCov

RobustCov 0 row 0 col (type ?, size 0)

statement : CALL at line 1232 column 1

1233 VarExplained=cusum(eval)/sum(eval);

ERROR: (execution) Matrix has not been set to a value.

operation : CUSUM at line 1233 column 19

operands : eval

eVal 0 row 0 col (type ?, size 0)

statement : ASSIGN at line 1233 column 1

1234 NumPC=1;

1235 RobustLoc=est[1,];

ERROR: (execution) Matrix has not been set to a value.

operation : [ at line 1235 column 14

operands : est, *LIT1004,

est 0 row 0 col (type ?, size 0)

*LIT1004 1 row 1 col (numeric)

1

statement : ASSIGN at line 1235 column 1

1236 c=(x-RobustLoc);

ERROR: (execution) Matrix has not been set to a value.

operation : - at line 1236 column 5

operands : x, RobustLoc

x 897 rows 37 cols (numeric)

RobustLoc 0 row 0 col (type ?, size 0)

statement : ASSIGN at line 1236 column 1

1237 Scores=c*eVec[,1:NumPC];

ERROR: (execution) Matrix has not been set to a value.

operation : [ at line 1237 column 14

operands : eVec, , *LIT1005, NumPC

Evec 0 row 0 col (type ?, size 0)

*LIT1005 1 row 1 col (numeric)

1

NumPC 1 row 1 col (numeric)

1

statement : ASSIGN at line 1237 column 1

1238 print scores;

ERROR: Matrix Scores has not been set to a value.

statement : PRINT at line 1238 column 1

1239

1240 create RobustPCA var {pgm scores};

1241 append;

1242 close robustpca;

NOTE: The data set WORK.ROBUSTPCA has 897 observations and 2 variables.

What am I doing wrong?

THank you,

Thomas

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

In SAS/IML, function calls use parentheses, so est is not a function. Subscripting of a matrix is accomplished by using square brackets, and that is what the expression est[3:2+p, ]; is doing.

The analysis on p. 9 is a continuation of the analysis on p. 8. To prevent the errors you are seeing, read the X matrix, set up the option vector, and call MCD.

The 'est' matrix was returned by the CALL MCD subroutine as the second argument. The documentation for the MCD subroutine says that this matrix has p columns, where p is the number of columns of the data matrix (in your case, 40). The rows are

est[1, ] = location of ellipsoid center est[2, ] = eigenvalues of final robust scatter matrix est[3:2+p, ] = the final robust scatter matrix

Thus if you define p=ncol(X), you shouldn't have to change any code. I assume you know that the notation y[1, ] means the first row of y. If not, read section 2.6 of my book.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@xtc283x In a private communication you said that I did not answer your question. I have re-read your post. Here are my answers to your four questions:

*What exactly is that line of IML code doing?*It extracts rows 3 through (2+p), where p is the number of variables in your problem.-
*Is the "3:2+p" based on the sample data you're using?*No, it will work for any data. *If I want to rewrite that line of code for the 40x900 matrix I'm working with, what would it look like?*A matrix with only 40 rows and 900 columns is degenerate. There are not enough observations to have a nonsingular covariance matrix. When you try to call the MCD function you will get an error likeERROR: The number of observations (40) must be larger than the number of variables (900).

*What am I doing wrong?*You are doing two things wrong. Your statistical error is that you must have more rows than columns in order to run Rousseeuw’s minimum covariance determinant (MCD) algorithm. Your programming error is that you must run the MCD analysis on p. 8 before you can extract the results from the EST matrix.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thank you.

**Don't miss out on SAS Innovate - Register now for the FREE Livestream!**

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

Multiple Linear Regression in SAS

Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.

Find more tutorials on the SAS Users YouTube channel.