## correlation of data in normal form: is there an easy way?

Solved
Frequent Contributor
Posts: 78

# correlation of data in normal form: is there an easy way?

I have data in normal form: key1, key2, value.  I'd like to compute the correlation of key2 pairs across key1s.  For example if key1 are dates and key2 are people and the values are their test scores, I want to correlate pairwise test scores of people over time.  I also want to capture the stat output (p-values, nobs) for each correlation coefficient.  The end result should be in normal form: key21, key22, corr, p-value, nobs.  For example, each row would contain the correlation of test scores of 2 people (identified by key21 and key22) etc.

Since proc corr wants the entities to be correlated in columns, I need to pivot the input data, compute the corr, capture the output, and un-pivot the output.  Given that data is often stored in normal form, I was hoping there was an easy way to do all this.  I can't find it.  Here's what I came up with.  Let me know of a better way.

* pivot normalized data into a table;

proc transpose data=Activity out=ActivityPivot prefix=H;

by Key1;

var Value2Correlate;

id Key2;

* Save output, including p-values;

* Output contains a var named Variable which holds Key2 values;

proc corr data=ActivityPivot ;

ods output PearsonCorr = IndividualCorr;

* un-pivot correlation output to normal form;

proc sort data=IndividualCorr;

by Variable;

proc transpose data=IndividualCorr out=blah;

by Variable;

* construct a nice output format: Key21 Key22 (identify the pairwise correlation)  Corr, p-value, nobs;

data CorrCorr;

set blah( rename=(Col1=Value) where=(_NAME_ like 'H%'));

Key21 = substr( Variable, 2);

Key22 = substr( _NAME_, 2);

drop Variable _NAME_;

run;

data CorrN;

set blah( rename=(Col1=Value) where=(_NAME_ like 'NH%'));

Key21 = substr( Variable, 2);

Key22 = substr( _NAME_, 3);

drop Variable _NAME_;

run;

data CorrP;

set blah( rename=(Col1=Value) where=(_NAME_ like 'PH%'));

Key21 = substr( Variable, 2);

Key22 = substr( _NAME_, 3);

drop Variable _NAME_;

run;

proc sql;

create table Here.HolderCorr as

select a.Key21, a.Key22, a.Value as Correlation, b.Value as PValue, c.Value as N

from CorrCorr a

join CorrP b

on a.Key21 = b.Key21

and a.Key22 = b.Key22

join CorrN c

on a.Key21 = c.Key21

and a.Key22 = c.Key22

order by 1,2;

PS: why can't I paste code from a text editor?  Ctrl-v didn't work.  I had to paste it into the html editor, causing me to lose all line breaks.

Accepted Solutions
Solution
‎09-27-2012 09:30 PM
Posts: 5,056

## Re: correlation of data in normal form: is there an easy way?

Well, you could always do the math yourself and avoid the pivot-unpivot, like this :

proc sql;

select key21, key22,

(sxy-n*mx*my)/sqrt(sx2*sy2) as corr,

n,

coalesce((1-probt(sqrt((n-2)*(calculated corr)**2 / (1-(calculated corr)**2)), n-2))*2, 0)

as probt format=pvalue.

from

(select t1.key2 as key21, t2.key2 as key22,

sum(t1.x*t2.x) as sxy,

mean(t1.x) as mx, mean(t2.x) as my,

css(t1.x) as sx2, css(t2.x) as sy2,

count(*) as n

from test as t1 inner join test as t2 on t1.key1=t2.key1

where t1.x is not missing and t2.x is not missing

group by t1.key2, t2.key2);

quit;

I think I got that right... :smileygrin:

PG

PG

All Replies
Solution
‎09-27-2012 09:30 PM
Posts: 5,056

## Re: correlation of data in normal form: is there an easy way?

Well, you could always do the math yourself and avoid the pivot-unpivot, like this :

proc sql;

select key21, key22,

(sxy-n*mx*my)/sqrt(sx2*sy2) as corr,

n,

coalesce((1-probt(sqrt((n-2)*(calculated corr)**2 / (1-(calculated corr)**2)), n-2))*2, 0)

as probt format=pvalue.

from

(select t1.key2 as key21, t2.key2 as key22,

sum(t1.x*t2.x) as sxy,

mean(t1.x) as mx, mean(t2.x) as my,

css(t1.x) as sx2, css(t2.x) as sy2,

count(*) as n

from test as t1 inner join test as t2 on t1.key1=t2.key1

where t1.x is not missing and t2.x is not missing

group by t1.key2, t2.key2);

quit;

I think I got that right... :smileygrin:

PG

PG
Frequent Contributor
Posts: 78

## Re: correlation of data in normal form: is there an easy way?

I'll give it a try.  I usually hesitate to roll my own; the gurus at SAS set sensible defaults for their procs.  (I've seen analyses go off the rails in MATLAB by "kids" who do regression using matrix math.)  Having said that, if it works (matches proc corr) and can handle more data (proc corr barfs on more than a few thousand columns), I'll come back and give you credit.

Even if this does not work out, I've learned a lot: I never knew about the calculated keyword for sql, and I've never used any but the most basic aggregate functions.  CSS is a new one to me.  Thanks for that.

Posts: 5,056

## Re: correlation of data in normal form: is there an easy way?

In any case, if you are doing thousands of tests, you should worry A LOT about multiple testing. Those p-values should be corrected with proc multtest. Be ready for a somewhat sobering experience. - PG

PG
Frequent Contributor
Posts: 78

## Re: correlation of data in normal form: is there an easy way?

If I understand you, my problem is that I will get a lot of false positives.  For example is I use a 99% confidence (p-value of 0.01 or smaller), I should expect to be fooled into thinking the correlation is non-zero when it really is zero, 1 in 100 correlations.  Is that right?

I'm not building a model, so the correlations are just a hint.  They give me a rock to turn over, to look for something bigger.  Hopefully that will save me.  I'll check out proc multtest anyway.  Learned a lot from your posts.  Thanks.

🔒 This topic is solved and locked.