05-30-2014 12:54 AM
Given 3 continuous variables, X, Y, and Z, the partial correlation between X and Y while controlling for Z can be calculated in the following steps:
1) Perform linear regression with X as the response and Z as the predictor. Denote the residuals from this regression as Rx.
2) Perform linear regression with Y as the response and Z as the predictor. Denote the residuals from this regression as Ry.
3) Calculate the correlation between Rx and Ry. This is the partial correlation between X and Y while controlling for Z.
The usual way of doing Step #3 is to use the Pearson correlation coefficient. My question DOES NOT concern this usual way, because I am interested in calculating partial correlation for data with outliers or for non-normal Rx/Ry.
There are 2 other ways to calculate partial correlation that can overcome outliers or non-normal residuals, and I'm trying to determine which of these is better.
- Perform Steps #1-2 (i.e. the regression) with the ranks of the data rather than the data themselves.
- Then, perform Step #3 using the Pearson correlation coefficient.
- Perform Steps #1-2 (i.e. the regression) in the usual way with the data.
- Then, perform Step #3 using the Spearman correlation coefficient.
My question to you: Which is better - Method A or Method B? PROC CORR uses Method A.
Perhaps a more specific way to phrase my question is: Which achieves a lower mean-squared error (MSE) - Method A or Method B? Recall that the MSE of a point estimator, theta-hat, is
MSE(theta-hat) = [Bias(theta-hat)]^2 + Variance(theta-hat)
05-30-2014 10:03 AM
I think I am missing something. Doesn't calculating the Pearson correlation on ranks give the same result as the Spearman correlation? If that is the case, then Method A is certainly more robust to outliers and possibly to distributional assumptions. However, I hesitate to say which will result in a lower MSE.
05-30-2014 01:19 PM
Thanks, Ksharp and Steve.
Just to add my thoughts, I don't like Method A because it reduces information from the data into ranks BEFORE the regression is done. Method B uses the full data to perform the regression, so more information is retained.
However, I'm still stuck on my original question: Which method is better?
05-30-2014 01:55 PM
I'll agree that B retains more information. However, it is much more sensitive to outliers and, in smaller datasets especially, lead to completely spurious results. Consider the following:
input x y z;
1 4 3
2 3 4
3 2.2 5.6
4 1 7
Note that Ry is negative. Now suppose a data entry error was made, and that last line was 4 1000 7000 (somebody dropped a decimal point). Now Ry is positive and a very strong correlation is found. However, if you transform to ranks before calculating Ry, it is still positive, but everything moves closer to zero, which you have to admit is closer to the true situation than what was found with the outlier values included. The regression coefficient is amazingly dependent on extreme values, whether as influential or high leverage points. If your data is moderately contaminated, or from a highly skewed distribution, these points can easily result in counterintuitive results.