Calcite | Level 5

## how to compute the average correlation across a very large correlation matrix

I’m trying to compute the average cross-sectional correlation across a large dataset. I may also want to compute a weighted average. Inaddition, my raw data is in stacked form. Assume, for example, I have a stacked list of investors and their portfolio weights (that sum to one for each investor) across all stocks (assume only four stocks).

Investor               Stock     Portfolio_weight

A                             IBM        0.50

A                             MSFT     0.40

A                             GOOG   0.10

A                             GRPN    0.00

B                             IBM        0.40

B                             MFST     0.60

B                             GOOG   0.00

B                             GRPM   0.00

C                             IBM        0.50

C                             MSFT     0

C                             GOGG   0

C                             GRPN    0.50

I wrote alittle test code (see below) to compute the average correlation between weights across investors. First, I compute a weights matrix where each row is a company and each column is an investor (data outb):

dataoutb

company

Obs      ID      _NAME_     A      B     C

1     GOOG        wt      0.1   0.0    0.0

2     GRPN        wt      0.0   0.0    0.5

3     IBM         wt      0.5   0.4    0.5

4      MSFT        wt     0.4    0.6    0.0

I then run a proc corr and keep the correlations (data d):

Obs    _TYPE_   _NAME_       A           B           C

1     CORR       A       1.00000     0.88684    0.00000

2     CORR       B       0.88684     1.00000   -0.19245

3      CORR      C       0.00000    -0.19245    1.00000

Because I only want the average of the off diagonal elements,I take the average times N, subtract 1 (the diagonal element) and divide by N-1 to get the average of the off diagonal elements: 0.23 = (0.886+0.00+-0.192)/3 (=ave_corr in data g)

I have four issues:

1.  The number of investors in my simple example is three. In my real data, the number is approximately 3,000, which means the correlation matrix will have 9M elements. Is that an issue? (Moreover the number of stocks is approximately 5,000.)

2.  My code seems a little clumsy. I welcome any suggestion for alternative approaches.

3.  In addition to average correlation, I’d like to compute the sum of the products of the weights for each pair of investors and compute the average of that (e.g., for A and B, 0.1*0 + 0*0 + 0.5*0.4 + 0.4*0.6). Again, there would be approximately 9M of these terms. If I define outb as a 4X3 matrix (in my simple example), I can compute this as outb(transposed)*outb.

After reading about this a bit, this seems like something I should be able to do with proc IML. However, I’m not familiar with proc IML. Any suggestions on how to define data outb as a matrix so I can use proc IML? And if I can do that, how would I’d go about getting the average of the elements in the resulting matrix?

4.  If I standardize the weights for each investor (i.e., subtract the mean and divided by the standard deviation), I could use the same method to compute the correlations (i.e., standardized weights matrix transposed *standardized weights matrix). So that might be another approach. Any thoughtson using IML versus my ‘traditional’ sas code?

Rick

My code

data a; input investor \$ companyID \$ wt;

cards;

A IBM 0.50

A MSFT 0.40

A GOOG 0.10

A GRPN 0.00

B IBM 0.40

B MSFT 0.60

B GOOG 0.00

B GRPN 0.00

C IBM 0.50

C MSFT 0.00

C GOOG 0.00

C GRPN 0.50

;

run;

data b; set a;

proc sort; by companyID;

proc transpose out=outb;by companyID;id investor; var wt;   *outb gives the weightsmatrix, columns are investors and rows are companies;

title 'data outb';

proc print data=outb;run;

data c; set outb;

proc corr out=outc;*computes correlation matrix;

data d; set outc;

if _TYPE_='CORR';

proc means noprint;

output out=outd;   *averagecorrelation for each investor;

data e; set outd;

if _STAT_='MEAN';

proc transpose out=oute;

data f; set oute;

if _NAME_='_TYPE_' then delete;

if _NAME_='_FREQ_' then delete;

ave_corr=col1;

proc means noprint; var ave_corr;

output out=outf mean=ave_corr_cross_investors;*average correlation across investors;

data g; set outf;

ave_corr=(ave_corr_cross_investors*_FREQ_-1)/(_FREQ_-1);*average correlation dropping the diagonal;

proc print; run;

1 ACCEPTED SOLUTION

Accepted Solutions
Super User

## how to compute the average correlation across a very large correlation matrix

Hi. Rick. I am not familar with IML.

If you need IML code, you can post it at IML forum. There is a specialist about IML who also is a SAS employee -- Rich M.

But I think you also can do it under Data Step.

```data a; input investor \$ companyID \$ wt;
cards;
A IBM 0.50
A MSFT 0.40
A GOOG 0.10
A GRPN 0.00
B IBM 0.40
B MSFT 0.60
B GOOG 0.00
B GRPN 0.00
C IBM 0.50
C MSFT 0.00
C GOOG 0.00
C GRPN 0.50
;
run;

proc sort data=a; by companyID;run;
proc transpose data=a out=temp(drop=_name_);
by companyid;
id investor;
var wt;
run;
proc corr data=temp outp=corr(drop=_name_ where=(_type_='CORR')) noprint;
var _numeric_;
run;
data _null_;
set corr end=last;
retain sum .;
array _a{*} _numeric_;
i=1;
do while(_a{i} ne 1);
sum=sum(sum,_a{i});i+1;n+1;
end;
if last then do;
mean=sum/n;
put 'ave_corr= ' mean;
end;
run;

options nomprint nomlogic nosymbolgen;
/*compute the sum of the products of the weights for
each pair of investors */
%macro across;
proc sql noprint;
select distinct investor into : list separated by ' ' from a;

create table want as
select
%do i=1 %to %sysfunc(countw(&list));
%do j=%eval(&i+1) %to %sysfunc(countw(&list));
sum(%scan(&list,&i)*%scan(&list,&j)) as %scan(&list,&i)%scan(&list,&j)
%if &i ne %eval(%sysfunc(countw(&list))-1) or  &j ne %sysfunc(countw(&list)) %then %do;,%end;
%end;
%end;
from temp;
quit;
%mend across;

%across

```

Ksharp

2 REPLIES 2
Super User

## how to compute the average correlation across a very large correlation matrix

Hi. Rick. I am not familar with IML.

If you need IML code, you can post it at IML forum. There is a specialist about IML who also is a SAS employee -- Rich M.

But I think you also can do it under Data Step.

```data a; input investor \$ companyID \$ wt;
cards;
A IBM 0.50
A MSFT 0.40
A GOOG 0.10
A GRPN 0.00
B IBM 0.40
B MSFT 0.60
B GOOG 0.00
B GRPN 0.00
C IBM 0.50
C MSFT 0.00
C GOOG 0.00
C GRPN 0.50
;
run;

proc sort data=a; by companyID;run;
proc transpose data=a out=temp(drop=_name_);
by companyid;
id investor;
var wt;
run;
proc corr data=temp outp=corr(drop=_name_ where=(_type_='CORR')) noprint;
var _numeric_;
run;
data _null_;
set corr end=last;
retain sum .;
array _a{*} _numeric_;
i=1;
do while(_a{i} ne 1);
sum=sum(sum,_a{i});i+1;n+1;
end;
if last then do;
mean=sum/n;
put 'ave_corr= ' mean;
end;
run;

options nomprint nomlogic nosymbolgen;
/*compute the sum of the products of the weights for
each pair of investors */
%macro across;
proc sql noprint;
select distinct investor into : list separated by ' ' from a;

create table want as
select
%do i=1 %to %sysfunc(countw(&list));
%do j=%eval(&i+1) %to %sysfunc(countw(&list));
sum(%scan(&list,&i)*%scan(&list,&j)) as %scan(&list,&i)%scan(&list,&j)
%if &i ne %eval(%sysfunc(countw(&list))-1) or  &j ne %sysfunc(countw(&list)) %then %do;,%end;
%end;
%end;
from temp;
quit;
%mend across;

%across

```

Ksharp

Calcite | Level 5

## how to compute the average correlation across a very large correlation matrix

Thanks once more Ksharp...you've saved me a lot of time...

Discussion stats
• 2 replies
• 2663 views
• 0 likes
• 2 in conversation