Re: Wrong adjustment for bias in Pearson correlation with proc corr

WillTheKiwi · Posted 08-14-2017 07:55 PM

I may be making some stupid error here, but I always thought the Pearson was biased low by sample size, and that the correction factor therefore increases the sample correlation to give an unbiased estimate of the population correlation. But when you use the Fisher option in proc corr, the bias correction (expressed as a tweak for the Fisher transformation) is a small positive value that is subtracted off the sample correlation when it should be added. See attached. I checked that simply invoking proc corr without fisher gave the sample correlation, and I checked wirht CORREL() in Excel and got the same sample correlation. However, I am one grey hair short of Alzheimers, so maybe it's just me. Help! I have to teach this next week.

Oh god, why can't we attach a screen shot as a jpg? Fix this restriction, please.

Reeza · Posted 08-14-2017 08:02 PM

You can attach JPGs using the Photos then it's embedded in your post rather than as an attachment which is preferable.

Can you post sample data so we can replicate the issue? Did you confirm the correlations in another package, I wouldn't use Excel to calculate averages personally.

ChrisHemedinger · Posted 08-14-2017 08:24 PM

I included the image for you. As Reeza said, use the Photos button in the editor to add images. We discourage image types as attachments because they aren't as easy to read. See more info here about this recent change.

Check out SAS Innovate on-demand content! Watch the main stage sessions, keynotes, and over 20 technical breakout sessions!

WillTheKiwi · Posted 08-14-2017 09:26 PM

I accepted the solution to the inclusion of images, thank you, but the correction for bias in the Pearson is still a problem I hope someone will confirm or solve.

ChrisHemedinger · Posted 08-15-2017 08:08 AM

Let's reserve the solution for the answer to your original question 🙂

Here's a timely article from @Rick_SAS about rank correlation -- it might not be your answer, but it provides some insight into the correlation methods.

Check out SAS Innovate on-demand content! Watch the main stage sessions, keynotes, and over 20 technical breakout sessions!

Rick_SAS · Posted 08-15-2017 08:50 AM

I assume you are talking about the bias in bivariate normal samples. The documentation for PROC CORR defines the bias adjustment function bias(r) = r / (2*(n-1)). That is the value that you are seeing in the table under the "Bias Adjustment" header. For example,

proc corr data=sashelp.class nosimple fisher;
var height weight;
run;

shows a "Bias Adjustment" of r/(2*(n-1)) = 0.87779 / (2*(19-1)) = 0.02438. (Note that this is NOT an estimate of the quantity E[r] - rho, which might be the source of your confusion.)

The section "Confidence Limits for the Correlation" describes how the bias(r) adjustment is subtracted from the estimate in the Fisher-transformed coordinates and then the estimate and CI is back- transformed back into [-1, 1].

WillTheKiwi · Posted 08-15-2017 04:24 PM

Thanks for the suggestion, but it doesn't solve the problem. I went back to my own files and website. In my own files I found I had done simulations years ago to show that the Pearson is biased low. It was the Pearson produced by proc corr, which is the one shown as the Sample Correlation in the listing. The correlation shown as the Correlation Estimate shoud be higher, not lower. My simulaitons also showed that the corresponding intraclass correlation coefficient, when the two variables are repeated measurements, has even worse bias, which surprised me, as I expected a variance divided by a variance to be unbiased. In the reliability spreadsheet at my website http://sportsci.org I have this comment in one of the cells: "The Pearson and intraclass correlation are biased low. The factor to correct the Pearson is 1 + (1-r^2)/(2(n-3)), where n is the sample size. Olkin, I., & Pratt, J.W. (1958). Unbiased estimation of certain correlation coefficients. Annals of Mathematical Statistics, 29, 201-211."

So I'm afraid there is an outright error in the Fisher option for proc corr. The correction goes the wrong way.Can someone from SAS please deal with this? Thanks.

Rick_SAS · Posted 08-16-2017 07:31 AM

I understand your argument and it sounds correct. I will discuss the issue with a colleague who knows more about this than I do.

Rick_SAS · Posted 08-21-2017 10:53 AM

I looked at this problem closer over the weekend and I am no closer to understanding it. I wrote a simulation in SAS that reproduces what you report, which is that the bias is WORSE when you subtract the bias adjustment fact (as PROC CORR currently does) and is BETTER if you add the bias adjustment factor (as you suggest).

On the other hand, I also found two textbooks that say that z = artanh(r) should be DECREASED (towards zero) by subtracting the bias adjustment. One is Anderson (cited by the PROC CORR doc) and the other is Snedecor and Cochran (Statistical Methods, 6^th Ed). You can see a PDF version of S&C at

http://krishikosh.egranth.ac.in/bitstream/1/2061674/2/IISR-127.pdf

If you go to p. 187-188, it says

“Fisher pointed out that there is a small bias in z, each being too large by rho/(2(n-1)). …The average r may be substituted for rho; then the approximate bias for each z may be deducted…. This will decrease the estimated r.”

I have discussed this issue with the PROC CORR developer and with Technical Support. If I get further information I will post it here.

Rick_SAS · Posted 09-04-2017 03:15 PM

Okay, I believe I have the resolution of this problem. I have discussed it with several colleagues in R&D, testing, and technical support, and I think we figured it out. I hope I can explain it correctly.

First, there is no bug. The bias adjustment that PROC CORR computes when you use the FISHER option is the correct adjustment that Fisher proposed. See p. 208-209 of Fisher's book, and I also found the same information in Anderson (p. 123) and Snedechor & Cochran (6th ed, p. 187-188). The primary use of the Fisher transform followed by the bias adjustment is to do hypothesis tests and to form confidence intervals for the population correlation. Both theory and simulation indicate that the results of PROC CORR for these two analyses are correct at the approximate 1-alpha level.

The secondary use of the Fisher transformation and bias adjustment is to estimate rho, the population correlation for bivariate normal data. This is the use-case that the OP asked about because his (and my original) simulation seems to indicate that the bias-adjusted correlation coefficient is incorrect.

Fisher, Snedechor & Cochran, and the PROC CORR documentation (second to last paragraph) each have a sentence or two about "using the average correlation across groups" to estimate rho, but I didn't appreciate what those sources were saying because it is a subtle fact. Briefly, in a Monte Carlo simulation we should be averaging values in z-coordinates, not r-coordinates.

Let z = f(f) = arctanh(r) be Fisher's transformation that maps the sampling distribution of r into the normal distribution. Let z0 = f(rho) be the image of the population correlation. Fisher noticed that the distribution of z is approximately normal, but mean(z) is not z0. Instead, there is a bias: mean(z) is larger than z0 by a (first-order) amount b=rho/(2(n-1)) . Therefore we should set z_adj = z - b so that mean(z_adj) = mean(z) - b which is approximately z0. If you backtransform mean(z_adj), you obtain the bias-adjusted estimate of rho. If tanh(z) is the inverse transformation, then an estimate for rho is tanh(mean(z_adj)).

So how might we estimate rho by using a Monte Carlo simulation? What the OP and I originally did was to compute the mean of the individual bias-adjusted estimates in the "r coordinates." That is, we estimated rho by using mean( tanh(z_adj) ) for many values of z_adj. THIS IS NOT CORRECT. In terms of expectation, we estimated E[tanh(z_adj)], whereas the correct computation is tanh( E[z_adj] ). Because tanh is a nonlinear function, the expectations are different.

I have attached a simulation that shows that if you compute tanh(mean(z - b)), then you get a reasonable estimate of rho. The image below shows the result of the simulation. The blue line is the simulated values of rho-E[r], which shows the known bias in r. The red line shows the simulated values of rho - tanh( E[z_BAdj] ), where z_BAdj is z-BiasAdj.

Incidentally, we are not the first people who explored the best way to estimate rho. See Corey, Dunlap, and Burke (1998) "Averaging Correlations: Expected Values and Bias in Combined Pearso... and the many references therein. They did not use the bias adjustment but still conclude (p. 160) "there is a consistent advantage associated with using [the average z, back-transformed]."

sld · Posted 09-07-2017 01:13 AM

Bravo! An above-and-beyond forum response. Thanks, Rick and SAS colleagues.

WillTheKiwi · Posted 09-07-2017 02:51 AM

I think the bravo might be premature. In my understanding of sample-size bias in a statistic, it is simply this: the mean of the statistic in samples of a given size is not equal to the population (very large sample) value of the statistic. That is, the sense of magnitude you would get by looking at a lot of samples would be different from the true, population, or very-large-sample magnitude. Hence, for example, the sample standard deviation is biased low, because the mean of the SD of samples of a given size is less than the population SD. With SDs derived from small samples, you can actually get the impression that the SD is a bit too small, with really small sample sizes. Surely exactly the same thing applies to the Pearson correlation coefficient? It's not a question of what transformation you apply to it before you then consider whether the transformed value is biased. I can transform the SD by squaring it. The resulting variance is unbiased: the mean of a lot of small-sample variances is unbiased. When I back transform the mean of the variances, I am back to a biased statistic, but there is much less bias, because the sample size is much bigger. The bottom line is that the Pearson correlation coefficient, as observed in samples of a finite size, is biased low. Isn't that the end of the story? When SAS shows a "correlation estimate" that is less than the sample correlation, it is quite simply wrong. I submit that the authors of the papers that have been cited here have actually misunderstood what small-sample bias is all about. The original authors, Olkin & Pratt (1958), got it right.

Rick_SAS · Posted 09-07-2017 02:34 PM

I am not an expert on this topic and I do not speak for SAS.

I think the source of the confusion in this thread centers on the purpose of the “Correlation Estimate” column in the “FisherPearsonCorr” table. I think the entry in the table is merely a computational check: it is the inverse-transform of the point in z-coordinates that was used to form the confidence interval: tanh(z – biasAdj). That's all it is. The FISHER option does not produce an unbiased estimate of rho, only a CI and a hypothesis test.

I didn't see anywhere in the doc that claims that the number is a better estimate of rho than r, but if I'm wrong then please point it out (screen capture?) so that I can tell the developer that the doc should be corrected. The only place where I saw a mention of an improved estimate of rho is the section that says that if you have multiple independent samples, you can obtain a better estimate for rho by using tanh(E[z]). (You could, of course, also use the DATA step to implement the Olkin and Pratt adjustment.)

I think that the title of that column ("Correlation Estimate") is confusing and should be changed. A reasonable person is likely to assume that the number is a bias-adjusted estimate of rho. I will mention this to the developer and see if (1) the documentation can be updated to clarify this issue, and (2) the title of that column can be changed to something less confusing. The upcoming release of SAS 9.4M5 is already frozen, but hopefully this issue can be clarified in SAS 9.5 (or whatever the 2018-2019 release is called.)

Best wishes, and thanks for raising this issue. It has been interesting.

Juggler_IN · Posted 12-13-2018 11:01 AM

With r = 0.923470 the bias adj. as per Keeping method is 0.024302. If one reduces this adj. factor from r score, one gets r = 0.899168 (= 0.923470 - 0.024302). But as per the screenshot the correlation estimate is r = 0.91981. Couldn't understand the correlation estimate value reported. Am I missing something in the calculation?