BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
jdmarino
Fluorite | Level 6

I have data in normal form: key1, key2, value.  I'd like to compute the correlation of key2 pairs across key1s.  For example if key1 are dates and key2 are people and the values are their test scores, I want to correlate pairwise test scores of people over time.  I also want to capture the stat output (p-values, nobs) for each correlation coefficient.  The end result should be in normal form: key21, key22, corr, p-value, nobs.  For example, each row would contain the correlation of test scores of 2 people (identified by key21 and key22) etc.

Since proc corr wants the entities to be correlated in columns, I need to pivot the input data, compute the corr, capture the output, and un-pivot the output.  Given that data is often stored in normal form, I was hoping there was an easy way to do all this.  I can't find it.  Here's what I came up with.  Let me know of a better way.

* pivot normalized data into a table;

proc transpose data=Activity out=ActivityPivot prefix=H;

     by Key1;

     var Value2Correlate;

     id Key2;

* Save output, including p-values;

* Output contains a var named Variable which holds Key2 values;

proc corr data=ActivityPivot ;

     ods output PearsonCorr = IndividualCorr;

* un-pivot correlation output to normal form;

proc sort data=IndividualCorr;

     by Variable;

proc transpose data=IndividualCorr out=blah;

     by Variable;

* construct a nice output format: Key21 Key22 (identify the pairwise correlation)  Corr, p-value, nobs;

data CorrCorr;

     set blah( rename=(Col1=Value) where=(_NAME_ like 'H%'));

     Key21 = substr( Variable, 2);

     Key22 = substr( _NAME_, 2);

     drop Variable _NAME_;

     run;

data CorrN;

     set blah( rename=(Col1=Value) where=(_NAME_ like 'NH%'));

     Key21 = substr( Variable, 2);

     Key22 = substr( _NAME_, 3);

     drop Variable _NAME_;

     run;

data CorrP;

     set blah( rename=(Col1=Value) where=(_NAME_ like 'PH%'));

     Key21 = substr( Variable, 2);

     Key22 = substr( _NAME_, 3);

     drop Variable _NAME_;

     run;

proc sql;

     create table Here.HolderCorr as

     select a.Key21, a.Key22, a.Value as Correlation, b.Value as PValue, c.Value as N

     from CorrCorr a

     join CorrP b

       on a.Key21 = b.Key21

      and a.Key22 = b.Key22

     join CorrN c

       on a.Key21 = c.Key21

      and a.Key22 = c.Key22

     order by 1,2;

PS: why can't I paste code from a text editor?  Ctrl-v didn't work.  I had to paste it into the html editor, causing me to lose all line breaks.

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

Well, you could always do the math yourself and avoid the pivot-unpivot, like this :

proc sql;

select key21, key22,

     (sxy-n*mx*my)/sqrt(sx2*sy2) as corr,

     n,

     coalesce((1-probt(sqrt((n-2)*(calculated corr)**2 / (1-(calculated corr)**2)), n-2))*2, 0)

          as probt format=pvalue.

from    

     (select t1.key2 as key21, t2.key2 as key22,

          sum(t1.x*t2.x) as sxy,

          mean(t1.x) as mx, mean(t2.x) as my,

          css(t1.x) as sx2, css(t2.x) as sy2,

          count(*) as n

       from test as t1 inner join test as t2 on t1.key1=t2.key1

       where t1.x is not missing and t2.x is not missing

       group by t1.key2, t2.key2);

quit;

I think I got that right... :smileygrin:

PG

PG

View solution in original post

4 REPLIES 4
PGStats
Opal | Level 21

Well, you could always do the math yourself and avoid the pivot-unpivot, like this :

proc sql;

select key21, key22,

     (sxy-n*mx*my)/sqrt(sx2*sy2) as corr,

     n,

     coalesce((1-probt(sqrt((n-2)*(calculated corr)**2 / (1-(calculated corr)**2)), n-2))*2, 0)

          as probt format=pvalue.

from    

     (select t1.key2 as key21, t2.key2 as key22,

          sum(t1.x*t2.x) as sxy,

          mean(t1.x) as mx, mean(t2.x) as my,

          css(t1.x) as sx2, css(t2.x) as sy2,

          count(*) as n

       from test as t1 inner join test as t2 on t1.key1=t2.key1

       where t1.x is not missing and t2.x is not missing

       group by t1.key2, t2.key2);

quit;

I think I got that right... :smileygrin:

PG

PG
jdmarino
Fluorite | Level 6

I'll give it a try.  I usually hesitate to roll my own; the gurus at SAS set sensible defaults for their procs.  (I've seen analyses go off the rails in MATLAB by "kids" who do regression using matrix math.)  Having said that, if it works (matches proc corr) and can handle more data (proc corr barfs on more than a few thousand columns), I'll come back and give you credit.

Even if this does not work out, I've learned a lot: I never knew about the calculated keyword for sql, and I've never used any but the most basic aggregate functions.  CSS is a new one to me.  Thanks for that.

PGStats
Opal | Level 21

In any case, if you are doing thousands of tests, you should worry A LOT about multiple testing. Those p-values should be corrected with proc multtest. Be ready for a somewhat sobering experience. - PG

PG
jdmarino
Fluorite | Level 6

If I understand you, my problem is that I will get a lot of false positives.  For example is I use a 99% confidence (p-value of 0.01 or smaller), I should expect to be fooled into thinking the correlation is non-zero when it really is zero, 1 in 100 correlations.  Is that right?

I'm not building a model, so the correlations are just a hint.  They give me a rock to turn over, to look for something bigger.  Hopefully that will save me.  I'll check out proc multtest anyway.  Learned a lot from your posts.  Thanks.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 1472 views
  • 4 likes
  • 2 in conversation