How sensitive is SAS/IML to two things:
- Extreme values or outliers in the data -- does this influence the eigenvalues?
- Very large, high dimensional data -- how much slower is the routine?
Any insights into these questions would be most appreciated.sas
The SVD on X (or, similarly, a principal component analysis on X`X) is a least-squarse technique, and consequently influenced by outliers just as OLS regression is. You can use the MCD routine to create a ROBUST prinicipal component analysis, as shown in Wicklin 2010.
The SVD and eigenvalue routines can be computationally expensive. You can use the techniques shown in the article "The Power Method" to time the computations for various size inputs. The article shows a graph for the eigenvalue computations on square NxN matrices. The time increases superlinearly. Run the program on your hardware to get estimates that are customerized to your configuration.
The SVD on X (or, similarly, a principal component analysis on X`X) is a least-squarse technique, and consequently influenced by outliers just as OLS regression is. You can use the MCD routine to create a ROBUST prinicipal component analysis, as shown in Wicklin 2010.
The SVD and eigenvalue routines can be computationally expensive. You can use the techniques shown in the article "The Power Method" to time the computations for various size inputs. The article shows a graph for the eigenvalue computations on square NxN matrices. The time increases superlinearly. Run the program on your hardware to get estimates that are customerized to your configuration.
Rick-
That's fantastic. Thank you!
I just have one more question: there's an approach to PCA that leverages the Lanczos algorithm as described here (https://en.wikipedia.org/wiki/Lanczos_algorithm ) with the complete method described here ( [1412.6506] Cauchy Principal Component Analysis ).
Any chance the Lanczos algorithm has a pre-existing routine or module in SAS/IML?
Thomas
No. The Lanczos algorithm is used for computing the first k eigenvalues/vectors, but it is not implemented in SAS/IML. It is possible that the SVD used by SAS Text Mining has an option to use the Lanczos method, but you'd have to ask someone in the text mining community.
Ok, thanks again.
Here's a blog post that directly addresses your question about timing eigenvalue computations:
http://blogs.sas.com/content/iml/2015/07/13/performance-of-algorithms.html
Great! Thank you...
Rick-
Thanks again for your help with developing a robust PCA. I'm a novice user of IML. I'm working with the SAS document you wrote -- Paper 329-2010 Rediscovering SAS/IML® Software: Modern Data Analysis for the Practicing Statistician. In addition, I have your book on Statistical Programming with SAS/IML. The matrix I'm analyzing for the robust PCA is 40 columns by 900 rows. I have a question for you about some code in your paper:
On page 9 you write:
/* compute PCA of robust covariance matrix */
p = ncol(x); /* 20 variables */
RobustCov = est[3:2+p, ]; /* robust estimate of shape parameters */
call eigen(eVal, eVec, RobustCov); /* PCA = eigenvectors of RobustCov */
I can't find any documentation on the "est" function in your book. So, when you write
RobustCov = est[3:2+p, ];
What exactly is that line of IML code doing? Is the "3:2+p" based on the sample data you're using? If I want to rewrite that line of code for the 40x900 matrix I'm working with, what would it look like? Currently, I'm getting errors with a simple replication of your code:
Cut and paste of IML error messages in my SAS log:
p=ncol(x);
RobustCov=ext[3:2+p, ];
ERROR: (execution) Matrix has not been set to a value.
operation : [ at line 1231 column 14
operands : ext, *LIT1001, _TEM1001,
ext 0 row 0 col (type ?, size 0)
*LIT1001 1 row 1 col (numeric)
3
_TEM1001 1 row 1 col (numeric)
39
statement : ASSIGN at line 1231 column 1
1232 call eigen(eVal, Evec, RobustCov);
ERROR: (execution) Character argument should be numeric.
operation : EIGEN at line 1232 column 1
operands : RobustCov
RobustCov 0 row 0 col (type ?, size 0)
statement : CALL at line 1232 column 1
1233 VarExplained=cusum(eval)/sum(eval);
ERROR: (execution) Matrix has not been set to a value.
operation : CUSUM at line 1233 column 19
operands : eval
eVal 0 row 0 col (type ?, size 0)
statement : ASSIGN at line 1233 column 1
1234 NumPC=1;
1235 RobustLoc=est[1,];
ERROR: (execution) Matrix has not been set to a value.
operation : [ at line 1235 column 14
operands : est, *LIT1004,
est 0 row 0 col (type ?, size 0)
*LIT1004 1 row 1 col (numeric)
1
statement : ASSIGN at line 1235 column 1
1236 c=(x-RobustLoc);
ERROR: (execution) Matrix has not been set to a value.
operation : - at line 1236 column 5
operands : x, RobustLoc
x 897 rows 37 cols (numeric)
RobustLoc 0 row 0 col (type ?, size 0)
statement : ASSIGN at line 1236 column 1
1237 Scores=c*eVec[,1:NumPC];
ERROR: (execution) Matrix has not been set to a value.
operation : [ at line 1237 column 14
operands : eVec, , *LIT1005, NumPC
Evec 0 row 0 col (type ?, size 0)
*LIT1005 1 row 1 col (numeric)
1
NumPC 1 row 1 col (numeric)
1
statement : ASSIGN at line 1237 column 1
1238 print scores;
ERROR: Matrix Scores has not been set to a value.
statement : PRINT at line 1238 column 1
1239
1240 create RobustPCA var {pgm scores};
1241 append;
1242 close robustpca;
NOTE: The data set WORK.ROBUSTPCA has 897 observations and 2 variables.
What am I doing wrong?
THank you,
Thomas
In SAS/IML, function calls use parentheses, so est is not a function. Subscripting of a matrix is accomplished by using square brackets, and that is what the expression est[3:2+p, ]; is doing.
The analysis on p. 9 is a continuation of the analysis on p. 8. To prevent the errors you are seeing, read the X matrix, set up the option vector, and call MCD.
The 'est' matrix was returned by the CALL MCD subroutine as the second argument. The documentation for the MCD subroutine says that this matrix has p columns, where p is the number of columns of the data matrix (in your case, 40). The rows are
Thus if you define p=ncol(X), you shouldn't have to change any code. I assume you know that the notation y[1, ] means the first row of y. If not, read section 2.6 of my book.
@xtc283x In a private communication you said that I did not answer your question. I have re-read your post. Here are my answers to your four questions:
ERROR: The number of observations (40) must be larger than the number of variables (900).
Thank you.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.
Find more tutorials on the SAS Users YouTube channel.