Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Analytics
- /
- Stat Procs
- /
- Wrong adjustment for bias in Pearson correlation with proc corr

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 08-14-2017 07:55 PM
(4170 views)

I may be making some stupid error here, but I always thought the Pearson was biased low by sample size, and that the correction factor therefore increases the sample correlation to give an unbiased estimate of the population correlation. But when you use the Fisher option in proc corr, the bias correction (expressed as a tweak for the Fisher transformation) is a small positive value that is subtracted off the sample correlation when it should be added. See attached. I checked that simply invoking proc corr without fisher gave the sample correlation, and I checked wirht CORREL() in Excel and got the same sample correlation. However, I am one grey hair short of Alzheimers, so maybe it's just me. Help! I have to teach this next week.

Oh god, why can't we attach a screen shot as a jpg? Fix this restriction, please.

13 REPLIES 13

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

You can attach JPGs using the Photos then it's embedded in your post rather than as an attachment which is preferable.

Can you post sample data so we can replicate the issue? Did you confirm the correlations in another package, I wouldn't use Excel to calculate averages personally.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I included the image for you. As Reeza said, use the Photos button in the editor to add images. We discourage image types as attachments because they aren't as easy to read. See more info here about this recent change.

SAS Innovate 2025: Call for Content! Submit your proposals before Sept 16. Accepted presenters get amazing perks to attend the conference!

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Let's reserve the solution for the answer to your original question 🙂

Here's a timely article from @Rick_SAS about rank correlation -- it might not be your answer, but it provides some insight into the correlation methods.

SAS Innovate 2025: Call for Content! Submit your proposals before Sept 16. Accepted presenters get amazing perks to attend the conference!

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I assume you are talking about the bias in bivariate normal samples. The documentation for PROC CORR defines the bias adjustment function bias(r) = r / (2*(n-1)). That is the value that you are seeing in the table under the "Bias Adjustment" header. For example,

```
proc corr data=sashelp.class nosimple fisher;
var height weight;
run;
```

shows a "Bias Adjustment" of r/(2*(n-1)) = 0.87779 / (2*(19-1)) = 0.02438. (Note that this is NOT an estimate of the quantity E[r] - rho, which might be the source of your confusion.)

The section "Confidence Limits for the Correlation" describes how the bias(r) adjustment is subtracted from the estimate in the Fisher-transformed coordinates and then the estimate and CI is back- transformed back into [-1, 1].

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thanks for the suggestion, but it doesn't solve the problem. I went back to my own files and website. In my own files I found I had done simulations years ago to show that the Pearson is biased low. It was the Pearson produced by proc corr, which is the one shown as the Sample Correlation in the listing. The correlation shown as the Correlation Estimate shoud be higher, not lower. My simulaitons also showed that the corresponding intraclass correlation coefficient, when the two variables are repeated measurements, has even worse bias, which surprised me, as I expected a variance divided by a variance to be unbiased. In the reliability spreadsheet at my website http://sportsci.org I have this comment in one of the cells: "The Pearson and intraclass correlation are biased low. The factor to correct the Pearson is 1 + (1-r^2)/(2(n-3)), where n is the sample size. Olkin, I., & Pratt, J.W. (1958). Unbiased estimation of certain correlation coefficients. Annals of Mathematical Statistics, 29, 201-211."

So I'm afraid there is an outright error in the Fisher option for proc corr. The correction goes the wrong way.Can someone from SAS please deal with this? Thanks.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I looked at this problem closer over the weekend and I am no closer to understanding it. I wrote a simulation in SAS that reproduces what you report, which is that the bias is WORSE when you subtract the bias adjustment fact (as PROC CORR currently does) and is BETTER if you add the bias adjustment factor (as you suggest).

On the other hand, I also found two textbooks that say that z = artanh(r) should be DECREASED (towards zero) by subtracting the bias adjustment. One is Anderson (cited by the PROC CORR doc) and the other is Snedecor and Cochran (*Statistical Methods, 6 ^{th} Ed). *You can see a PDF version of S&C at

http://krishikosh.egranth.ac.in/bitstream/1/2061674/2/IISR-127.pdf

If you go to p. 187-188, it says

“Fisher pointed out that there is a small bias in z, each being too large by rho/(2(n-1)). …The average r may be substituted for rho; then the approximate bias for each z may be deducted…. This will decrease the estimated r.”

I have discussed this issue with the PROC CORR developer and with Technical Support. If I get further information I will post it here.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Okay, I believe I have the resolution of this problem. I have discussed it with several colleagues in R&D, testing, and technical support, and I think we figured it out. I hope I can explain it correctly.

First, there is no bug. The bias adjustment that PROC CORR computes when you use the FISHER option is the correct adjustment that Fisher proposed. See p. 208-209 of Fisher's book, and I also found the same information in Anderson (p. 123) and Snedechor & Cochran (6th ed, p. 187-188). The primary use of the Fisher transform followed by the bias adjustment is to do hypothesis tests and to form confidence intervals for the population correlation. Both theory and simulation indicate that the results of PROC CORR for these two analyses are correct at the approximate 1-alpha level.

The secondary use of the Fisher transformation and bias adjustment is to estimate rho, the population correlation for bivariate normal data. This is the use-case that the OP asked about because his (and my original) simulation seems to indicate that the bias-adjusted correlation coefficient is incorrect.

Fisher, Snedechor & Cochran, and the PROC CORR documentation (second to last paragraph) each have a sentence or two about "using the average correlation across groups" to estimate rho, but I didn't appreciate what those sources were saying because it is a subtle fact. Briefly, in a Monte Carlo simulation we should be averaging values in z-coordinates, not r-coordinates.

Let z = f(f) = arctanh(r) be Fisher's transformation that maps the sampling distribution of r into the normal distribution. Let z0 = f(rho) be the image of the population correlation. Fisher noticed that the distribution of z is approximately normal, but mean(z) is not z0. Instead, there is a bias: mean(z) is larger than z0 by a (first-order) amount b=rho/(2(n-1)) . Therefore we should set z_adj = z - b so that mean(z_adj) = mean(z) - b which is approximately z0. If you backtransform mean(z_adj), you obtain the bias-adjusted estimate of rho. If tanh(z) is the inverse transformation, then an estimate for rho is tanh(mean(z_adj)).

So how might we estimate rho by using a Monte Carlo simulation? What the OP and I originally did was to compute the mean of the individual bias-adjusted estimates in the "r coordinates." That is, we estimated rho by using mean( tanh(z_adj) ) for many values of z_adj. THIS IS NOT CORRECT. In terms of expectation, we estimated **E**[tanh(z_adj)], whereas the correct computation is tanh( **E**[z_adj] ). Because tanh is a nonlinear function, the expectations are different.

I have attached a simulation that shows that if you compute tanh(mean(z - b)), then you get a reasonable estimate of rho. The image below shows the result of the simulation. The blue line is the simulated values of rho-**E**[r], which shows the known bias in r. The red line shows the simulated values of rho - tanh( **E**[z_BAdj] ), where z_BAdj is z-BiasAdj.

Incidentally, we are not the first people who explored the best way to estimate rho. See Corey, Dunlap, and Burke (1998) "Averaging Correlations: Expected Values and Bias in Combined Pearso... and the many references therein. They did not use the bias adjustment but still conclude (p. 160) "there is a consistent advantage associated with using [the average z, back-transformed]."

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Bravo! An above-and-beyond forum response. Thanks, Rick and SAS colleagues.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I am not an expert on this topic and I do not speak for SAS.

I think the source of the confusion in this thread centers on the purpose of the “Correlation Estimate” column in the “FisherPearsonCorr” table. I think the entry in the table is merely a computational check: it is the inverse-transform of the point in z-coordinates that was used to form the confidence interval: tanh(z – biasAdj). That's all it is. The FISHER option does not produce an unbiased estimate of rho, only a CI and a hypothesis test.

I didn't see anywhere in the doc that claims that the number is a better estimate of rho than r, but if I'm wrong then please point it out (screen capture?) so that I can tell the developer that the doc should be corrected. The only place where I saw a mention of an improved estimate of rho is the section that says that if you have multiple independent samples, you can obtain a better estimate for rho by using tanh(E[z]). (You could, of course, also use the DATA step to implement the Olkin and Pratt adjustment.)

I think that the title of that column ("Correlation Estimate") is confusing and should be changed. A reasonable person is likely to assume that the number is a bias-adjusted estimate of rho. I will mention this to the developer and see if (1) the documentation can be updated to clarify this issue, and (2) the title of that column can be changed to something less confusing. The upcoming release of SAS 9.4M5 is already frozen, but hopefully this issue can be clarified in SAS 9.5 (or whatever the 2018-2019 release is called.)

Best wishes, and thanks for raising this issue. It has been interesting.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Are you ready for the spotlight? We're accepting content ideas for **SAS Innovate 2025** to be held May 6-9 in Orlando, FL. The call is **open **until September 25. Read more here about **why** you should contribute and **what is in it** for you!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.