turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Stat Procs
- /
- PROC CORR: Calculating the Spearman Partial Correl...

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

05-30-2014 12:54 AM

Dear Community,

Given 3 continuous variables, X, Y, and Z, the partial correlation between X and Y while controlling for Z can be calculated in the following steps:

1) Perform linear regression with X as the response and Z as the predictor. Denote the residuals from this regression as Rx.

2) Perform linear regression with Y as the response and Z as the predictor. Denote the residuals from this regression as Ry.

3) Calculate the correlation between Rx and Ry. This is the partial correlation between X and Y while controlling for Z.

The usual way of doing Step #3 is to use the Pearson correlation coefficient. My question DOES NOT concern this usual way, because I am interested in calculating partial correlation for data with outliers or for non-normal Rx/Ry.

There are 2 other ways to calculate partial correlation that can overcome outliers or non-normal residuals, and I'm trying to determine which of these is better.

Method A:

- Perform Steps #1-2 (i.e. the regression) with the ranks of the data rather than the data themselves.

- Then, perform Step #3 using the Pearson correlation coefficient.

Method B:

- Perform Steps #1-2 (i.e. the regression) in the usual way with the data.

- Then, perform Step #3 using the Spearman correlation coefficient.

My question to you: Which is better - Method A or Method B? PROC CORR uses Method A.

Perhaps a more specific way to phrase my question is: Which achieves a lower mean-squared error (MSE) - Method A or Method B? Recall that the MSE of a point estimator, theta-hat, is

MSE(theta-hat) = [Bias(theta-hat)]^2 + Variance(theta-hat)

Thanks,

Eric

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

05-30-2014 10:03 AM

Hi Eric,

I think I am missing something. Doesn't calculating the Pearson correlation on ranks give the same result as the Spearman correlation? If that is the case, then Method A is certainly more robust to outliers and possibly to distributional assumptions. However, I hesitate to say which will result in a lower MSE.

Steve Denham

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

05-30-2014 10:31 AM

I bet Method A. since the residual of it is also a rank that also has the power of the spearman rank correlation.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

05-30-2014 01:19 PM

Thanks, Ksharp and Steve.

Just to add my thoughts, I don't like Method A because it reduces information from the data into ranks BEFORE the regression is done. Method B uses the full data to perform the regression, so more information is retained.

However, I'm still stuck on my original question: Which method is better?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

05-30-2014 01:55 PM

I'll agree that B retains more information. However, it is much more sensitive to outliers and, in smaller datasets especially, lead to completely spurious results. Consider the following:

data whass:

input x y z;

datalines;

1 4 3

2 3 4

3 2.2 5.6

4 1 7

;

Note that Ry is negative. Now suppose a data entry error was made, and that last line was 4 1000 7000 (somebody dropped a decimal point). Now Ry is positive and a very strong correlation is found. However, if you transform to ranks before calculating Ry, it is still positive, but everything moves closer to zero, which you have to admit is closer to the true situation than what was found with the outlier values included. The regression coefficient is amazingly dependent on extreme values, whether as influential or high leverage points. If your data is moderately contaminated, or from a highly skewed distribution, these points can easily result in counterintuitive results.

Steve Denham

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

05-31-2014 01:19 AM

Agree with Doc Steve. If there are not outliers I would definitely choose B.

Xia Keshan

Message was edited by: xia keshan