turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

07-12-2015 09:35 AM

How sensitive is SAS/IML to two things:

- Extreme values or outliers in the data -- does this influence the eigenvalues?

- Very large, high dimensional data -- how much slower is the routine?

Any insights into these questions would be most appreciated.sas

Accepted Solutions

Solution

07-12-2015
01:18 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

07-12-2015 01:18 PM

The SVD on X (or, similarly, a principal component analysis on X`X) is a least-squarse technique, and consequently influenced by outliers just as OLS regression is. You can use the MCD routine to create a ROBUST prinicipal component analysis, as shown in Wicklin 2010.

The SVD and eigenvalue routines can be computationally expensive. You can use the techniques shown in the article "The Power Method" to time the computations for various size inputs. The article shows a graph for the eigenvalue computations on square NxN matrices. The time increases superlinearly. Run the program on your hardware to get estimates that are customerized to your configuration.

All Replies

Solution

07-12-2015
01:18 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

07-12-2015 01:18 PM

The SVD on X (or, similarly, a principal component analysis on X`X) is a least-squarse technique, and consequently influenced by outliers just as OLS regression is. You can use the MCD routine to create a ROBUST prinicipal component analysis, as shown in Wicklin 2010.

The SVD and eigenvalue routines can be computationally expensive. You can use the techniques shown in the article "The Power Method" to time the computations for various size inputs. The article shows a graph for the eigenvalue computations on square NxN matrices. The time increases superlinearly. Run the program on your hardware to get estimates that are customerized to your configuration.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

07-12-2015 01:50 PM

Rick-

That's fantastic. Thank you!

I just have one more question: there's an approach to PCA that leverages the Lanczos algorithm as described here (https://en.wikipedia.org/wiki/Lanczos_algorithm ) with the complete method described here ( [1412.6506] Cauchy Principal Component Analysis ).

Any chance the Lanczos algorithm has a pre-existing routine or module in SAS/IML?

Thomas

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

07-12-2015 03:32 PM

No. The Lanczos algorithm is used for computing the first k eigenvalues/vectors, but it is not implemented in SAS/IML. It is possible that the SVD used by SAS Text Mining has an option to use the Lanczos method, but you'd have to ask someone in the text mining community.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

07-12-2015 04:02 PM

Ok, thanks again.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

07-13-2015 07:17 AM

Here's a blog post that directly addresses your question about timing eigenvalue computations:

http://blogs.sas.com/content/iml/2015/07/13/performance-of-algorithms.html

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

07-13-2015 07:34 AM

Great! Thank you...

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

09-14-2015 02:04 PM

Rick-

Thanks again for your help with developing a robust PCA. I'm a novice user of IML. I'm working with the SAS document you wrote --* Paper 329-2010 **Rediscovering SAS/IML® Software: Modern Data Analysis for the Practicing Statistician.* In addition, I have your book on *Statistical Programming with SAS/IML.* The matrix I'm analyzing for the robust PCA is 40 columns by 900 rows. I have a question for you about some code in your paper:

On page 9 you write:

/* compute PCA of robust covariance matrix */

p = ncol(x); /* 20 variables */

RobustCov = est[3:2+p, ]; /* robust estimate of shape parameters */

call eigen(eVal, eVec, RobustCov); /* PCA = eigenvectors of RobustCov */

I can't find any documentation on the "est" function in your book. So, when you write

RobustCov = est[3:2+p, ];

What exactly is that line of IML code doing? Is the "3:2+p" based on the sample data you're using? If I want to rewrite that line of code for the 40x900 matrix I'm working with, what would it look like? Currently, I'm getting errors with a simple replication of your code:

Cut and paste of IML error messages in my SAS log:

p=ncol(x);

RobustCov=ext[3:2+p, ];

ERROR: (execution) Matrix has not been set to a value.

operation : [ at line 1231 column 14

operands : ext, *LIT1001, _TEM1001,

ext 0 row 0 col (type ?, size 0)

*LIT1001 1 row 1 col (numeric)

3

_TEM1001 1 row 1 col (numeric)

39

statement : ASSIGN at line 1231 column 1

1232 call eigen(eVal, Evec, RobustCov);

ERROR: (execution) Character argument should be numeric.

operation : EIGEN at line 1232 column 1

operands : RobustCov

RobustCov 0 row 0 col (type ?, size 0)

statement : CALL at line 1232 column 1

1233 VarExplained=cusum(eval)/sum(eval);

ERROR: (execution) Matrix has not been set to a value.

operation : CUSUM at line 1233 column 19

operands : eval

eVal 0 row 0 col (type ?, size 0)

statement : ASSIGN at line 1233 column 1

1234 NumPC=1;

1235 RobustLoc=est[1,];

ERROR: (execution) Matrix has not been set to a value.

operation : [ at line 1235 column 14

operands : est, *LIT1004,

est 0 row 0 col (type ?, size 0)

*LIT1004 1 row 1 col (numeric)

1

statement : ASSIGN at line 1235 column 1

1236 c=(x-RobustLoc);

ERROR: (execution) Matrix has not been set to a value.

operation : - at line 1236 column 5

operands : x, RobustLoc

x 897 rows 37 cols (numeric)

RobustLoc 0 row 0 col (type ?, size 0)

statement : ASSIGN at line 1236 column 1

1237 Scores=c*eVec[,1:NumPC];

ERROR: (execution) Matrix has not been set to a value.

operation : [ at line 1237 column 14

operands : eVec, , *LIT1005, NumPC

Evec 0 row 0 col (type ?, size 0)

*LIT1005 1 row 1 col (numeric)

1

NumPC 1 row 1 col (numeric)

1

statement : ASSIGN at line 1237 column 1

1238 print scores;

ERROR: Matrix Scores has not been set to a value.

statement : PRINT at line 1238 column 1

1239

1240 create RobustPCA var {pgm scores};

1241 append;

1242 close robustpca;

NOTE: The data set WORK.ROBUSTPCA has 897 observations and 2 variables.

What am I doing wrong?

THank you,

Thomas

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

09-15-2015 09:58 AM

In SAS/IML, function calls use parentheses, so est is not a function. Subscripting of a matrix is accomplished by using square brackets, and that is what the expression est[3:2+p, ]; is doing.

The analysis on p. 9 is a continuation of the analysis on p. 8. To prevent the errors you are seeing, read the X matrix, set up the option vector, and call MCD.

The 'est' matrix was returned by the CALL MCD subroutine as the second argument. The documentation for the MCD subroutine says that this matrix has p columns, where p is the number of columns of the data matrix (in your case, 40). The rows are

est[1, ] = location of ellipsoid center est[2, ] = eigenvalues of final robust scatter matrix est[3:2+p, ] = the final robust scatter matrix

Thus if you define p=ncol(X), you shouldn't have to change any code. I assume you know that the notation y[1, ] means the first row of y. If not, read section 2.6 of my book.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

09-28-2015 03:50 PM

@xtc283x In a private communication you said that I did not answer your question. I have re-read your post. Here are my answers to your four questions:

*What exactly is that line of IML code doing?*It extracts rows 3 through (2+p), where p is the number of variables in your problem.-
*Is the "3:2+p" based on the sample data you're using?*No, it will work for any data. *If I want to rewrite that line of code for the 40x900 matrix I'm working with, what would it look like?*A matrix with only 40 rows and 900 columns is degenerate. There are not enough observations to have a nonsingular covariance matrix. When you try to call the MCD function you will get an error likeERROR: The number of observations (40) must be larger than the number of variables (900).

*What am I doing wrong?*You are doing two things wrong. Your statistical error is that you must have more rows than columns in order to run Rousseeuw’s minimum covariance determinant (MCD) algorithm. Your programming error is that you must run the MCD analysis on p. 8 before you can extract the results from the EST matrix.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

09-29-2015 05:38 AM

Thank you.