Correlation in between x and x2(square)

CSDoot1 · Posted 07-19-2019 04:20 PM

What would be the Pearson correlation coefficient between x and x2. if X can take only positive values (>0).

ballardw · Posted 07-19-2019 04:32 PM

Depends on the actual values of x and how many repetitions the individual x values might have.

You can play with the upper and lower bounds on X in the do loop below and examine output for some feelling.

data work.junk;
   do x = 1 to 1000;
      x2= x*x;
     output;
   end;
run;

proc corr data=work.junk pearson;
run;

A Spearman correlation would be 1.

CSDoot1 · Posted 07-19-2019 05:06 PM

Here X is independent random variables and can take any positive value in that case what would be the Pearson's correlation coefficient.

PaigeMiller · Posted 07-19-2019 05:33 PM

@CSDoot1 wrote:

Here X is independent random variables and can take any positive value in that case what would be the Pearson's correlation coefficient.

@ballardw has given you code to estimate the correlation between x and x^2

--
Paige Miller

ballardw · Posted 07-19-2019 05:56 PM

@CSDoot1 wrote:

Here X is independent random variables and can take any positive value in that case what would be the Pearson's correlation coefficient.

Without knowing the actual values of X there is no "closed form" answer to this question.

If you have not looked at the formula of the Pearson correlation coefficient, then it is time to do so.

If you don't understand the equation then time to brush up on several elements.

FreelanceReinh · Posted 07-22-2019 05:20 PM

Hi @CSDoot1,

@CSDoot1 wrote:

Here X is independent random variables and can take any positive value in that case what would be the Pearson's correlation coefficient.

As others have pointed out, the answer to your question depends on the data (if it's about the empirical correlation) or on the distribution of X (if it's about the random variables X and X²). For the latter case @Reeza has shown the relevant formula. It contains various expected values, so you see that the Pearson correlation coefficient rX,X² may not even exist for certain distributions of X (e.g. Cauchy distribution).

Just for demonstration let's assume that X has a continuous distribution with a probability density f(x)>0 for all x>0 and f(x)=0 for all x<0 such that the first four moments about the origin (and hence the Pearson correlation coefficient) exist. Under these assumptions the set of possible values of rX,X² equals the open interval (0, 1) (i.e., for any value 0<q<1 there's a distribution such that rX,X²=q).

Here's why:

Our assumptions imply that the first four moments about the origin (i.e. E[Xⁿ] for n=1, 2, 3, 4) and also Var(X) are positive.

First, we see that rX,X²>0: The numerator in the formula, in our case: E[X³]−E[X]E[X²], is positive. This follows from the fact that E[X]²<E[X²] (because Var(X)>0), hence E[X]<√E[X²] and E[X]E[X²]<E[X²]^(3/2). The latter term is ≤E[X³] by Lyapounov's inequality (a special case of which states: E[X²]^(1/2) ≤ E[|X|³]^(1/3); see, e.g., Billingsley, Probability and Measure, 3rd ed., p. 81).

Second, rX,X²<1: Indeed, rX,X²=1 would imply P(X²=aX+b)=1 for some numbers a, b by a well known result found in textbooks on mathematical statistics (see, e.g., Casella, Berger: Statistical Inference, 2nd ed., Theorem 4.5.7b). But x²>ax+b for all x>k and a suitable k=k(a,b)>0 and we have P(X>k)>0 by our assumption on f.

It's easy to find distributions for which rX,X² takes large values (<1).

Example: The gamma distribution with parameters a, l > 0. A straight-forward calculation shows that in this case

rX,X²=sqrt((a+1)/(a+3/2))

(irrespective of scale parameter l) -- an expression whose values cover the range (√(2/3), 1), i.e. (0.816..., 1).

I found it more difficult to construct a family of distributions for which rX,X² takes small values (>0). This mixture of a uniform distribution and an exponential distribution seems to have this feature (for certain parameter values specified later): With parameters 0<c<1 and r≥1 define f(x)=1/c−c^(r−1) for 0≤x≤c and f(x)=c^r exp(c−x) for x>c.

After calculating the terms in the correlation coefficient formula (see SAS code below) it turns out that for 3<r<4 the lowest order terms (with respect to c) in the numerator and denominator are c³/12 and (under the square root) 2c^(r+2), respectively. Since 3>(r+2)/2, the limit of rX,X² for c→0 is 0 (which is also true for some other values of r). On the other side we have rX,X² → 3/sqrt(10)=0.94868... for c→1 (irrespective of r) and for large values of r such as r=6: rX,X² → sqrt(15)/4=0.96824... for c→0, thus overlapping the range found for the gamma distributions.

Here is SAS code to compute and plot rX,X² against c for selected values of r:

data pearson(drop=i j);
array _r[6] _temporary_ (1, 2.7, 3.5, 5, 6, 8);
do j=1 to dim(_r);
  r=_r[j];
  do i=1 to 500;
    c=i/500;
    m1=c**r/2*(c+2)+c/2;
    m2=2*c**r/3*(c**2+3*c+3)+c**2/3;
    m3=3*c**r/4*(c**3+4*c**2+8*c+8)+c**3/4;
    m4=4*c**r/5*(c**4+5*c**3+15*c**2+30*c+30)+c**4/5;
    rho=(m3-m1*m2)/sqrt((m2-m1**2)*(m4-m2**2));
    output;
  end;
end;
lim1=sqrt(15)/4;
ref1='sqrt(15)/4';
lim2=3/sqrt(10);
ref2='3/sqrt(10)';
output;
run;

proc sgplot data=pearson;
series x=c y=rho / group=r;
refline lim1 / label=ref1;
refline lim2 / label=ref2;
run;

Code for a simulation of the above family of distributions (caveat: numerical instability for extreme values of the parameters, e.g. small values of c):

%let c=0.1;
%let r=5;

data sim(drop=i);
call streaminit(27182818);
do i=1 to 1e7;
  if rand('bern',1-&c**&r) then x=rand('uniform',0,&c);
  else x=rand('expo',1)+&c;
  x2=x**2;
  output;
end;
run;

proc corr data=sim;
var x x2;
run;

[Edit: Fixed the hyperlink to the documentation on the uniform distribution.]

Reeza · Posted 07-19-2019 09:25 PM

Take the formula and replace the components by x and x2 where necessary and see if you can simplify it. That would likely be your answer.

https://en.wikipedia.org/wiki/Pearson_correlation_coefficient

$\rho _{X,Y}={\frac {\operatorname {E} [XY]-\operatorname {E} [X]\operatorname {E} [Y]}{{\sqrt {\operatorname {E} [X^{2}]-[\operatorname {E} [X]]^{2}}}~{\sqrt {\operatorname {E} [Y^{2}]-[\operatorname {E} [Y]]^{2}}}}}.$

But correlation is intended to measure the strength of a linear relationship, and x vs x2 is not a linear relationship, it's a quadratic relationship. If you recall from calculus, the range of the X will affect whether it appears as quadratic or linear, smaller ranges can appear linear instead.

HTH.

@CSDoot1 wrote:

What would be the Pearson correlation coefficient between x and x2. if X can take only positive values (>0).

Ksharp · Posted 07-20-2019 08:00 AM

Pearson Correlation is linear correlation NOT non-linear.

But yours is non-linear . I remembered @Rick_SAS wrote a blog about testing such non-linear correlation.

Ha. It is called distance correlation.

https://blogs.sas.com/content/iml/2018/04/04/distance-correlation.html

Correlation in between x and x2(square)

Re: Correlation in between x and x2(square)

Re: Correlation in between x and x2(square)

Re: Correlation in between x and x2(square)

Re: Correlation in between x and x2(square)

Re: Correlation in between x and x2(square)

Re: Correlation in between x and x2(square)

Re: Correlation in between x and x2(square)

Registration is open