What would be the Pearson correlation coefficient between x and x2. if X can take only positive values (>0).
Depends on the actual values of x and how many repetitions the individual x values might have.
You can play with the upper and lower bounds on X in the do loop below and examine output for some feelling.
data work.junk; do x = 1 to 1000; x2= x*x; output; end; run; proc corr data=work.junk pearson; run;
A Spearman correlation would be 1.
Here X is independent random variables and can take any positive value in that case what would be the Pearson's correlation coefficient.
@CSDoot1 wrote:
Here X is independent random variables and can take any positive value in that case what would be the Pearson's correlation coefficient.
Without knowing the actual values of X there is no "closed form" answer to this question.
If you have not looked at the formula of the Pearson correlation coefficient, then it is time to do so.
If you don't understand the equation then time to brush up on several elements.
Hi @CSDoot1,
@CSDoot1 wrote:
Here X is independent random variables and can take any positive value in that case what would be the Pearson's correlation coefficient.
As others have pointed out, the answer to your question depends on the data (if it's about the empirical correlation) or on the distribution of X (if it's about the random variables X and X²). For the latter case @Reeza has shown the relevant formula. It contains various expected values, so you see that the Pearson correlation coefficient rX,X² may not even exist for certain distributions of X (e.g. Cauchy distribution).
Just for demonstration let's assume that X has a continuous distribution with a probability density f(x)>0 for all x>0 and f(x)=0 for all x<0 such that the first four moments about the origin (and hence the Pearson correlation coefficient) exist. Under these assumptions the set of possible values of rX,X² equals the open interval (0, 1) (i.e., for any value 0<q<1 there's a distribution such that rX,X²=q).
Here's why:
Our assumptions imply that the first four moments about the origin (i.e. E[Xⁿ] for n=1, 2, 3, 4) and also Var(X) are positive.
First, we see that rX,X²>0: The numerator in the formula, in our case: E[X³]−E[X]E[X²], is positive. This follows from the fact that E[X]²<E[X²] (because Var(X)>0), hence E[X]<√E[X²] and E[X]E[X²]<E[X²]^(3/2). The latter term is ≤E[X³] by Lyapounov's inequality (a special case of which states: E[X²]^(1/2) ≤ E[|X|³]^(1/3); see, e.g., Billingsley, Probability and Measure, 3rd ed., p. 81).
Second, rX,X²<1: Indeed, rX,X²=1 would imply P(X²=aX+b)=1 for some numbers a, b by a well known result found in textbooks on mathematical statistics (see, e.g., Casella, Berger: Statistical Inference, 2nd ed., Theorem 4.5.7b). But x²>ax+b for all x>k and a suitable k=k(a,b)>0 and we have P(X>k)>0 by our assumption on f.
It's easy to find distributions for which rX,X² takes large values (<1).
Example: The gamma distribution with parameters a, l > 0. A straight-forward calculation shows that in this case
rX,X²=sqrt((a+1)/(a+3/2))
(irrespective of scale parameter l) -- an expression whose values cover the range (√(2/3), 1), i.e. (0.816..., 1).
I found it more difficult to construct a family of distributions for which rX,X² takes small values (>0). This mixture of a uniform distribution and an exponential distribution seems to have this feature (for certain parameter values specified later): With parameters 0<c<1 and r≥1 define f(x)=1/c−c^(r−1) for 0≤x≤c and f(x)=c^r exp(c−x) for x>c.
After calculating the terms in the correlation coefficient formula (see SAS code below) it turns out that for 3<r<4 the lowest order terms (with respect to c) in the numerator and denominator are c³/12 and (under the square root) 2c^(r+2), respectively. Since 3>(r+2)/2, the limit of rX,X² for c→0 is 0 (which is also true for some other values of r). On the other side we have rX,X² → 3/sqrt(10)=0.94868... for c→1 (irrespective of r) and for large values of r such as r=6: rX,X² → sqrt(15)/4=0.96824... for c→0, thus overlapping the range found for the gamma distributions.
Here is SAS code to compute and plot rX,X² against c for selected values of r:
data pearson(drop=i j);
array _r[6] _temporary_ (1, 2.7, 3.5, 5, 6, 8);
do j=1 to dim(_r);
r=_r[j];
do i=1 to 500;
c=i/500;
m1=c**r/2*(c+2)+c/2;
m2=2*c**r/3*(c**2+3*c+3)+c**2/3;
m3=3*c**r/4*(c**3+4*c**2+8*c+8)+c**3/4;
m4=4*c**r/5*(c**4+5*c**3+15*c**2+30*c+30)+c**4/5;
rho=(m3-m1*m2)/sqrt((m2-m1**2)*(m4-m2**2));
output;
end;
end;
lim1=sqrt(15)/4;
ref1='sqrt(15)/4';
lim2=3/sqrt(10);
ref2='3/sqrt(10)';
output;
run;
proc sgplot data=pearson;
series x=c y=rho / group=r;
refline lim1 / label=ref1;
refline lim2 / label=ref2;
run;
Code for a simulation of the above family of distributions (caveat: numerical instability for extreme values of the parameters, e.g. small values of c):
%let c=0.1;
%let r=5;
data sim(drop=i);
call streaminit(27182818);
do i=1 to 1e7;
if rand('bern',1-&c**&r) then x=rand('uniform',0,&c);
else x=rand('expo',1)+&c;
x2=x**2;
output;
end;
run;
proc corr data=sim;
var x x2;
run;
[Edit: Fixed the hyperlink to the documentation on the uniform distribution.]
Take the formula and replace the components by x and x2 where necessary and see if you can simplify it. That would likely be your answer.
https://en.wikipedia.org/wiki/Pearson_correlation_coefficient
But correlation is intended to measure the strength of a linear relationship, and x vs x2 is not a linear relationship, it's a quadratic relationship. If you recall from calculus, the range of the X will affect whether it appears as quadratic or linear, smaller ranges can appear linear instead.
HTH.
@CSDoot1 wrote:
What would be the Pearson correlation coefficient between x and x2. if X can take only positive values (>0).
Pearson Correlation is linear correlation NOT non-linear.
But yours is non-linear . I remembered @Rick_SAS wrote a blog about testing such non-linear correlation.
Ha. It is called distance correlation.
https://blogs.sas.com/content/iml/2018/04/04/distance-correlation.html
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.