turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Stat Procs
- /
- correlation of data in normal form: is there an ea...

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

09-27-2012 11:37 AM

I have data in normal form: key1, key2, value. I'd like to compute the correlation of key2 pairs across key1s. For example if key1 are dates and key2 are people and the values are their test scores, I want to correlate pairwise test scores of people over time. I also want to capture the stat output (p-values, nobs) for each correlation coefficient. The end result should be in normal form: key21, key22, corr, p-value, nobs. For example, each row would contain the correlation of test scores of 2 people (identified by key21 and key22) etc.

Since proc corr wants the entities to be correlated in columns, I need to pivot the input data, compute the corr, capture the output, and un-pivot the output. Given that data is often stored in normal form, I was hoping there was an easy way to do all this. I can't find it. Here's what I came up with. **Let me know of a better way.**

* pivot normalized data into a table;

proc transpose data=Activity out=ActivityPivot prefix=H;

by Key1;

var Value2Correlate;

id Key2;

* Save output, including p-values;

* Output contains a var named Variable which holds Key2 values;

proc corr data=ActivityPivot ;

ods output PearsonCorr = IndividualCorr;

* un-pivot correlation output to normal form;

proc sort data=IndividualCorr;

by Variable;

proc transpose data=IndividualCorr out=blah;

by Variable;

* construct a nice output format: Key21 Key22 (identify the pairwise correlation) Corr, p-value, nobs;

data CorrCorr;

set blah( rename=(Col1=Value) where=(_NAME_ like 'H%'));

Key21 = substr( Variable, 2);

Key22 = substr( _NAME_, 2);

drop Variable _NAME_;

run;

data CorrN;

set blah( rename=(Col1=Value) where=(_NAME_ like 'NH%'));

Key21 = substr( Variable, 2);

Key22 = substr( _NAME_, 3);

drop Variable _NAME_;

run;

data CorrP;

set blah( rename=(Col1=Value) where=(_NAME_ like 'PH%'));

Key21 = substr( Variable, 2);

Key22 = substr( _NAME_, 3);

drop Variable _NAME_;

run;

proc sql;

create table Here.HolderCorr as

select a.Key21, a.Key22, a.Value as Correlation, b.Value as PValue, c.Value as N

from CorrCorr a

join CorrP b

on a.Key21 = b.Key21

and a.Key22 = b.Key22

join CorrN c

on a.Key21 = c.Key21

and a.Key22 = c.Key22

order by 1,2;

PS: why can't I paste code from a text editor? Ctrl-v didn't work. I had to paste it into the html editor, causing me to lose all line breaks.

Accepted Solutions

Solution

09-27-2012
09:30 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to jdmarino

09-27-2012 09:30 PM

Well, you could always do the math yourself and avoid the pivot-unpivot, like this :

**proc sql;**

**select key21, key22, **

** (sxy-n*mx*my)/sqrt(sx2*sy2) as corr,**

** n,**

** coalesce((1-probt(sqrt((n-2)*(calculated corr)**2 / (1-(calculated corr)**2)), n-2))*2, 0) **

** as probt format=pvalue.**

**from **

** (select t1.key2 as key21, t2.key2 as key22, **

** sum(t1.x*t2.x) as sxy, **

** mean(t1.x) as mx, mean(t2.x) as my,**

** css(t1.x) as sx2, css(t2.x) as sy2,**

** count(*) as n**

** from test as t1 inner join test as t2 on t1.key1=t2.key1 **

** where t1.x is not missing and t2.x is not missing**

** group by t1.key2, t2.key2);**

**quit;**

I think I got that right... :smileygrin:

PG

PG

All Replies

Solution

09-27-2012
09:30 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to jdmarino

09-27-2012 09:30 PM

Well, you could always do the math yourself and avoid the pivot-unpivot, like this :

**proc sql;**

**select key21, key22, **

** (sxy-n*mx*my)/sqrt(sx2*sy2) as corr,**

** n,**

** coalesce((1-probt(sqrt((n-2)*(calculated corr)**2 / (1-(calculated corr)**2)), n-2))*2, 0) **

** as probt format=pvalue.**

**from **

** (select t1.key2 as key21, t2.key2 as key22, **

** sum(t1.x*t2.x) as sxy, **

** mean(t1.x) as mx, mean(t2.x) as my,**

** css(t1.x) as sx2, css(t2.x) as sy2,**

** count(*) as n**

** from test as t1 inner join test as t2 on t1.key1=t2.key1 **

** where t1.x is not missing and t2.x is not missing**

** group by t1.key2, t2.key2);**

**quit;**

I think I got that right... :smileygrin:

PG

PG

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to PGStats

09-28-2012 07:58 AM

I'll give it a try. I usually hesitate to roll my own; the gurus at SAS set sensible defaults for their procs. (I've seen analyses go off the rails in MATLAB by "kids" who do regression using matrix math.) Having said that, if it works (matches proc corr) and can handle more data (proc corr barfs on more than a few thousand columns), I'll come back and give you credit.

Even if this does not work out, I've learned a lot: I never knew about the calculated keyword for sql, and I've never used any but the most basic aggregate functions. CSS is a new one to me. Thanks for that.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to jdmarino

09-28-2012 07:08 PM

In any case, if you are doing thousands of tests, you should worry A LOT about multiple testing. Those p-values should be corrected with proc multtest. Be ready for a somewhat sobering experience. - PG

PG

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to PGStats

10-01-2012 08:33 AM

If I understand you, my problem is that I will get a lot of false positives. For example is I use a 99% confidence (p-value of 0.01 or smaller), I should expect to be fooled into thinking the correlation is non-zero when it really is zero, 1 in 100 correlations. Is that right?

I'm not building a model, so the correlations are just a hint. They give me a rock to turn over, to look for something bigger. Hopefully that will save me. I'll check out proc multtest anyway. Learned a lot from your posts. Thanks.