BookmarkSubscribeRSS Feed
EricCai
Calcite | Level 5

Dear Community,

Given 3 continuous variables, X, Y, and Z, the partial correlation between X and Y while controlling for Z can be calculated in the following steps:

1) Perform linear regression with X as the response and Z as the predictor. Denote the residuals from this regression as Rx.

2) Perform linear regression with Y as the response and Z as the predictor. Denote the residuals from this regression as Ry.

3) Calculate the correlation between Rx and Ry. This is the partial correlation between X and Y while controlling for Z.

The usual way of doing Step #3 is to use the Pearson correlation coefficient. My question DOES NOT concern this usual way, because I am interested in calculating partial correlation for data with outliers or for non-normal Rx/Ry.

There are 2 other ways to calculate partial correlation that can overcome outliers or non-normal residuals, and I'm trying to determine which of these is better.

Method A:

- Perform Steps #1-2 (i.e. the regression) with the ranks of the data rather than the data themselves.

- Then, perform Step #3 using the Pearson correlation coefficient.

Method B:

- Perform Steps #1-2 (i.e. the regression) in the usual way with the data.

- Then, perform Step #3 using the Spearman correlation coefficient.

My question to you: Which is better - Method A or Method B?  PROC CORR uses Method A.

Perhaps a more specific way to phrase my question is: Which achieves a lower mean-squared error (MSE) - Method A or Method B? Recall that the MSE of a point estimator, theta-hat, is

MSE(theta-hat) = [Bias(theta-hat)]^2 + Variance(theta-hat)

Thanks,

Eric

5 REPLIES 5
SteveDenham
Jade | Level 19

Hi Eric,

I think I am missing something.  Doesn't calculating the Pearson correlation on ranks give the same result as the Spearman correlation?  If that is the case, then Method A is certainly more robust to outliers and possibly to distributional assumptions.  However, I hesitate to say which will result in a lower MSE.

Steve Denham

Ksharp
Super User

I bet Method A. since the residual of it is also a rank that also has the power of the spearman rank correlation.

EricCai
Calcite | Level 5

Thanks, Ksharp and Steve.

Just to add my thoughts, I don't like Method A because it reduces information from the data into ranks BEFORE the regression is done.  Method B uses the full data to perform the regression, so more information is retained.

However, I'm still stuck on my original question: Which method is better?

SteveDenham
Jade | Level 19

I'll agree that B retains more information.  However, it is much more sensitive to outliers and, in smaller datasets especially, lead to completely spurious results.  Consider the following:

data whass:

input x y z;

datalines;

1 4 3

2 3 4

3 2.2 5.6

4 1 7

;

Note that Ry is negative.  Now suppose a data entry error was made, and that last line was 4 1000 7000 (somebody dropped a decimal point).  Now Ry is positive and a very strong correlation is found.  However, if you transform to ranks before calculating Ry, it is still positive, but everything moves closer to zero, which you have to admit is closer to the true situation than what was found with the outlier values included.  The regression coefficient is amazingly dependent on extreme values, whether as influential or high leverage points.  If your data is moderately contaminated, or from a highly skewed distribution, these points can easily result in counterintuitive results.

Steve Denham

Ksharp
Super User

Agree with Doc Steve. If there are not outliers I would definitely choose B.

Xia Keshan

Message was edited by: xia keshan

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 2623 views
  • 0 likes
  • 3 in conversation