BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
xtc283x
Quartz | Level 8

How sensitive is SAS/IML to two things:

- Extreme values or outliers in the data -- does this influence the eigenvalues?

- Very large, high dimensional data -- how much slower is the routine?

Any insights into these questions would be most appreciated.sas

1 ACCEPTED SOLUTION

Accepted Solutions
Rick_SAS
SAS Super FREQ

The SVD on X (or, similarly, a principal component analysis on X`X) is a least-squarse technique, and consequently influenced by outliers just as OLS regression is.  You can use the MCD routine to create a ROBUST prinicipal component analysis, as shown in Wicklin 2010.

The SVD and eigenvalue routines can be computationally expensive.  You can use the techniques shown in the article "The Power Method" to time the computations for various size inputs.  The article shows a graph for the eigenvalue computations on square NxN matrices.  The time increases superlinearly. Run the program on your hardware to get estimates that are customerized to your configuration.

View solution in original post

10 REPLIES 10
Rick_SAS
SAS Super FREQ

The SVD on X (or, similarly, a principal component analysis on X`X) is a least-squarse technique, and consequently influenced by outliers just as OLS regression is.  You can use the MCD routine to create a ROBUST prinicipal component analysis, as shown in Wicklin 2010.

The SVD and eigenvalue routines can be computationally expensive.  You can use the techniques shown in the article "The Power Method" to time the computations for various size inputs.  The article shows a graph for the eigenvalue computations on square NxN matrices.  The time increases superlinearly. Run the program on your hardware to get estimates that are customerized to your configuration.

xtc283x
Quartz | Level 8

Rick-

   That's fantastic. Thank you!

   I just have one more question: there's an approach to PCA that leverages the Lanczos algorithm as described here (https://en.wikipedia.org/wiki/Lanczos_algorithm ) with the complete method described here ( [1412.6506] Cauchy Principal Component Analysis ).

   Any chance the Lanczos algorithm has a pre-existing routine or module in SAS/IML?

Thomas

Rick_SAS
SAS Super FREQ

No. The Lanczos algorithm is used for computing the first k eigenvalues/vectors, but it is not implemented in SAS/IML.  It is possible that the SVD used by SAS Text Mining has an option to use the Lanczos method, but you'd have to ask someone in the text mining community.

xtc283x
Quartz | Level 8

Ok, thanks again.

Rick_SAS
SAS Super FREQ

Here's a blog post that directly addresses your question about timing eigenvalue computations:

http://blogs.sas.com/content/iml/2015/07/13/performance-of-algorithms.html

xtc283x
Quartz | Level 8

Great! Thank you...

xtc283x
Quartz | Level 8

Rick-

   Thanks again for your help with developing a robust PCA. I'm a novice user of IML. I'm working with the SAS document you wrote -- Paper 329-2010  Rediscovering SAS/IML® Software: Modern Data Analysis for the Practicing Statistician. In addition, I have your book on Statistical Programming with SAS/IML. The matrix I'm analyzing for the robust PCA is 40 columns by 900 rows. I have a question for you about some code in your paper:

 

On page 9 you write: 

 

   /* compute PCA of robust covariance matrix */
   p = ncol(x); /* 20 variables */
   RobustCov = est[3:2+p, ]; /* robust estimate of shape parameters */
   call eigen(eVal, eVec, RobustCov); /* PCA = eigenvectors of RobustCov */

 

I can't find any documentation on the "est" function in your book. So, when you write

 

                   RobustCov = est[3:2+p, ];

 

What exactly is that line of IML code doing? Is the "3:2+p" based on the sample data you're using? If I want to rewrite that line of code for the 40x900 matrix I'm working with, what would it look like? Currently, I'm getting errors with a simple replication of your code:

 

Cut and paste of IML error messages in my SAS log:

 

p=ncol(x);
RobustCov=ext[3:2+p, ];
ERROR: (execution) Matrix has not been set to a value.

operation : [ at line 1231 column 14
operands : ext, *LIT1001, _TEM1001,

ext 0 row 0 col (type ?, size 0)


*LIT1001 1 row 1 col (numeric)

3

_TEM1001 1 row 1 col (numeric)

39

statement : ASSIGN at line 1231 column 1
1232 call eigen(eVal, Evec, RobustCov);
ERROR: (execution) Character argument should be numeric.

operation : EIGEN at line 1232 column 1
operands : RobustCov

RobustCov 0 row 0 col (type ?, size 0)


statement : CALL at line 1232 column 1
1233 VarExplained=cusum(eval)/sum(eval);
ERROR: (execution) Matrix has not been set to a value.

operation : CUSUM at line 1233 column 19
operands : eval

eVal 0 row 0 col (type ?, size 0)


statement : ASSIGN at line 1233 column 1
1234 NumPC=1;
1235 RobustLoc=est[1,];
ERROR: (execution) Matrix has not been set to a value.

operation : [ at line 1235 column 14
operands : est, *LIT1004,

est 0 row 0 col (type ?, size 0)


*LIT1004 1 row 1 col (numeric)

1

statement : ASSIGN at line 1235 column 1
1236 c=(x-RobustLoc);
ERROR: (execution) Matrix has not been set to a value.

operation : - at line 1236 column 5
operands : x, RobustLoc
x 897 rows 37 cols (numeric)

RobustLoc 0 row 0 col (type ?, size 0)


statement : ASSIGN at line 1236 column 1
1237 Scores=c*eVec[,1:NumPC];
ERROR: (execution) Matrix has not been set to a value.

operation : [ at line 1237 column 14
operands : eVec, , *LIT1005, NumPC

Evec 0 row 0 col (type ?, size 0)


*LIT1005 1 row 1 col (numeric)

1

NumPC 1 row 1 col (numeric)

1

statement : ASSIGN at line 1237 column 1
1238 print scores;
ERROR: Matrix Scores has not been set to a value.

statement : PRINT at line 1238 column 1
1239
1240 create RobustPCA var {pgm scores};
1241 append;
1242 close robustpca;
NOTE: The data set WORK.ROBUSTPCA has 897 observations and 2 variables.

 

 

What am I doing wrong?

 

THank you,

Thomas

 

 

Rick_SAS
SAS Super FREQ

In SAS/IML, function calls use parentheses, so est is not a function.  Subscripting of a matrix is accomplished by using square brackets, and that is what the expression est[3:2+p, ]; is doing.

 

The analysis on p. 9 is a continuation of the analysis on p. 8. To prevent the errors you are seeing, read the X matrix, set up the option vector, and call MCD.

 

The 'est' matrix was returned by the CALL MCD subroutine as the second argument. The documentation for the MCD subroutine says that this matrix has p columns, where p is the number of columns of the data matrix (in your case, 40). The rows are

est[1, ] = location of ellipsoid center est[2, ] = eigenvalues of final robust scatter matrix est[3:2+p, ] = the final robust scatter matrix

Thus if you define p=ncol(X), you shouldn't have to change any code.   I assume you know that the notation y[1, ] means the first row of y. If not, read section 2.6 of my book.

 

Rick_SAS
SAS Super FREQ

@xtc283x In a private communication you said that I did not answer your question. I have re-read your post. Here are my answers to your four questions:

  1. What exactly is that line of IML code doing? It extracts rows 3 through (2+p), where p is the number of variables in your problem.
  2.  Is the "3:2+p" based on the sample data you're using? No, it will work for any data.
  3. If I want to rewrite that line of code for the 40x900 matrix I'm working with, what would it look like? A matrix with only 40 rows and 900 columns is degenerate. There are not enough observations to have a nonsingular covariance matrix.  When you try to call the MCD function you will get an error like

    ERROR: The number of observations (40) must be larger than the number of variables (900).

  4. What am I doing wrong? You are doing two things wrong. Your statistical error is that you must have more rows than columns in order to run Rousseeuw’s minimum covariance determinant (MCD) algorithm. Your programming error is that you must run the MCD analysis on p. 8 before you can extract the results from the EST matrix.

 

xtc283x
Quartz | Level 8

Thank you.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

Multiple Linear Regression in SAS

Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.

Find more tutorials on the SAS Users YouTube channel.

From The DO Loop
Want more? Visit our blog for more articles like these.
Discussion stats
  • 10 replies
  • 2319 views
  • 3 likes
  • 2 in conversation