I’m trying to compute the average cross-sectional correlation across a large dataset. I may also want to compute a weighted average. Inaddition, my raw data is in stacked form. Assume, for example, I have a stacked list of investors and their portfolio weights (that sum to one for each investor) across all stocks (assume only four stocks).
Investor Stock Portfolio_weight
A IBM 0.50
A MSFT 0.40
A GOOG 0.10
A GRPN 0.00
B IBM 0.40
B MFST 0.60
B GOOG 0.00
B GRPM 0.00
C IBM 0.50
C MSFT 0
C GOGG 0
C GRPN 0.50
I wrote alittle test code (see below) to compute the average correlation between weights across investors. First, I compute a weights matrix where each row is a company and each column is an investor (data outb):
dataoutb
company
Obs ID _NAME_ A B C
1 GOOG wt 0.1 0.0 0.0
2 GRPN wt 0.0 0.0 0.5
3 IBM wt 0.5 0.4 0.5
4 MSFT wt 0.4 0.6 0.0
I then run a proc corr and keep the correlations (data d):
Obs _TYPE_ _NAME_ A B C
1 CORR A 1.00000 0.88684 0.00000
2 CORR B 0.88684 1.00000 -0.19245
3 CORR C 0.00000 -0.19245 1.00000
Because I only want the average of the off diagonal elements,I take the average times N, subtract 1 (the diagonal element) and divide by N-1 to get the average of the off diagonal elements: 0.23 = (0.886+0.00+-0.192)/3 (=ave_corr in data g)
I have four issues:
1. The number of investors in my simple example is three. In my real data, the number is approximately 3,000, which means the correlation matrix will have 9M elements. Is that an issue? (Moreover the number of stocks is approximately 5,000.)
2. My code seems a little clumsy. I welcome any suggestion for alternative approaches.
3. In addition to average correlation, I’d like to compute the sum of the products of the weights for each pair of investors and compute the average of that (e.g., for A and B, 0.1*0 + 0*0 + 0.5*0.4 + 0.4*0.6). Again, there would be approximately 9M of these terms. If I define outb as a 4X3 matrix (in my simple example), I can compute this as outb(transposed)*outb.
After reading about this a bit, this seems like something I should be able to do with proc IML. However, I’m not familiar with proc IML. Any suggestions on how to define data outb as a matrix so I can use proc IML? And if I can do that, how would I’d go about getting the average of the elements in the resulting matrix?
4. If I standardize the weights for each investor (i.e., subtract the mean and divided by the standard deviation), I could use the same method to compute the correlations (i.e., standardized weights matrix transposed *standardized weights matrix). So that might be another approach. Any thoughtson using IML versus my ‘traditional’ sas code?
Thanks inadvance.
Rick
My code
data a; input investor $ companyID $ wt;
cards;
A IBM 0.50
A MSFT 0.40
A GOOG 0.10
A GRPN 0.00
B IBM 0.40
B MSFT 0.60
B GOOG 0.00
B GRPN 0.00
C IBM 0.50
C MSFT 0.00
C GOOG 0.00
C GRPN 0.50
;
run;
data b; set a;
proc sort; by companyID;
proc transpose out=outb;by companyID;id investor; var wt; *outb gives the weightsmatrix, columns are investors and rows are companies;
title 'data outb';
proc print data=outb;run;
data c; set outb;
proc corr out=outc;*computes correlation matrix;
data d; set outc;
if _TYPE_='CORR';
proc means noprint;
output out=outd; *averagecorrelation for each investor;
data e; set outd;
if _STAT_='MEAN';
proc transpose out=oute;
data f; set oute;
if _NAME_='_TYPE_' then delete;
if _NAME_='_FREQ_' then delete;
ave_corr=col1;
proc means noprint; var ave_corr;
output out=outf mean=ave_corr_cross_investors;*average correlation across investors;
data g; set outf;
ave_corr=(ave_corr_cross_investors*_FREQ_-1)/(_FREQ_-1);*average correlation dropping the diagonal;
proc print; run;
Hi. Rick. I am not familar with IML.
If you need IML code, you can post it at IML forum. There is a specialist about IML who also is a SAS employee -- Rich M.
But I think you also can do it under Data Step.
data a; input investor $ companyID $ wt; cards; A IBM 0.50 A MSFT 0.40 A GOOG 0.10 A GRPN 0.00 B IBM 0.40 B MSFT 0.60 B GOOG 0.00 B GRPN 0.00 C IBM 0.50 C MSFT 0.00 C GOOG 0.00 C GRPN 0.50 ; run; proc sort data=a; by companyID;run; proc transpose data=a out=temp(drop=_name_); by companyid; id investor; var wt; run; proc corr data=temp outp=corr(drop=_name_ where=(_type_='CORR')) noprint; var _numeric_; run; data _null_; set corr end=last; retain sum .; array _a{*} _numeric_; i=1; do while(_a{i} ne 1); sum=sum(sum,_a{i});i+1;n+1; end; if last then do; mean=sum/n; put 'ave_corr= ' mean; end; run; options nomprint nomlogic nosymbolgen; /*compute the sum of the products of the weights for each pair of investors */ %macro across; proc sql noprint; select distinct investor into : list separated by ' ' from a; create table want as select %do i=1 %to %sysfunc(countw(&list)); %do j=%eval(&i+1) %to %sysfunc(countw(&list)); sum(%scan(&list,&i)*%scan(&list,&j)) as %scan(&list,&i)%scan(&list,&j) %if &i ne %eval(%sysfunc(countw(&list))-1) or &j ne %sysfunc(countw(&list)) %then %do;,%end; %end; %end; from temp; quit; %mend across; %across
Ksharp
Hi. Rick. I am not familar with IML.
If you need IML code, you can post it at IML forum. There is a specialist about IML who also is a SAS employee -- Rich M.
But I think you also can do it under Data Step.
data a; input investor $ companyID $ wt; cards; A IBM 0.50 A MSFT 0.40 A GOOG 0.10 A GRPN 0.00 B IBM 0.40 B MSFT 0.60 B GOOG 0.00 B GRPN 0.00 C IBM 0.50 C MSFT 0.00 C GOOG 0.00 C GRPN 0.50 ; run; proc sort data=a; by companyID;run; proc transpose data=a out=temp(drop=_name_); by companyid; id investor; var wt; run; proc corr data=temp outp=corr(drop=_name_ where=(_type_='CORR')) noprint; var _numeric_; run; data _null_; set corr end=last; retain sum .; array _a{*} _numeric_; i=1; do while(_a{i} ne 1); sum=sum(sum,_a{i});i+1;n+1; end; if last then do; mean=sum/n; put 'ave_corr= ' mean; end; run; options nomprint nomlogic nosymbolgen; /*compute the sum of the products of the weights for each pair of investors */ %macro across; proc sql noprint; select distinct investor into : list separated by ' ' from a; create table want as select %do i=1 %to %sysfunc(countw(&list)); %do j=%eval(&i+1) %to %sysfunc(countw(&list)); sum(%scan(&list,&i)*%scan(&list,&j)) as %scan(&list,&i)%scan(&list,&j) %if &i ne %eval(%sysfunc(countw(&list))-1) or &j ne %sysfunc(countw(&list)) %then %do;,%end; %end; %end; from temp; quit; %mend across; %across
Ksharp
Thanks once more Ksharp...you've saved me a lot of time...
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.