Are there any differences between how proc corr computes a correlation matrix and the way that it is computed for other procs that rely on it?
For example, proc princomp is a statistical procedure that relies on the correlation matrix of a set of data. When I run proc princomp on my data, it returns with the correlation matrix and eigen-vectors relatively quickly. When I run proc corr on the same set of data, it will take over a day to finish.
PRINCOMP uses the Pearson correlation and that should be computed the same; it is very easy (and quick) to compute (only the SAS staff can tell you if it uses the exact same routines). If you use CORR with the defaults, you also get a Spearman correlation, which is rank based. If your dataset is large, this can take a long time to compute (if it fails to build the rank matrix in memory, it uses disk).
Look at the section on computational details for CORR. It says, "... If M bytes are not available, PROC CORR must process the data multiple times to compute all the statistics."
I am working with a large dataset, but I am feeding the same dataset to both procs. Proc princomp can return with results in minutes. Proc corr can take several hours.
I understand that both procedures are attempting to build the pearson correlation, just confused as to why one is so much faster than the other, all else being equal. (ie same datasets, same statistic being computed)
If anything princomp should take longer to run since it requires more computations afterwards, including diagonalizing the matrix which isn't trivial. I also notice the same thing with other procs as well, such as proc varclus which in theory is based off the correlation matrix as well having much shorter run times than proc corr.
It isn't a big issue, anytime I want a correlation matrix, I just use princomp to get it for me. Just curious.
Message was edited by: jwu1234
By default, PROC CORR uses pairwise deletion when observations contain missing values. PROC CORR includes all nonmissing pairs of values for each pair of variables in the statistical computations. Therefore, the correlation statistics might be based on different numbers of observations and the PROC needs to examine p(p-1)/2 pairs of variables.
If you specify the NOMISS option, PROC CORR uses listwise deletion. Listwise deletion is what PRINCOMP and other STAT procs use. It is faster because you can delete any observation that contains a missing value and you never have to deal with that observation again.
So if you like the PRINCOMP way, you can use PROC CORR with the NOMISS option.