computing a similarity matrix

coug914 · Posted 12-29-2011 12:34 PM

Iam not familiar with IML, but think it may be well suited for computing asimilarity matrix. Specifically, I have large stack of data that consists of investors, stocks, and the investors’ weights in the stocks. The first part of my code unstacks the data and puts it into what looks like a 4X3 holdings matrix (data outb, which I’ll call “H”).

dataa; input investor $ companyID $ wt;

cards;

A IBM 0.50

A MSFT 0.40

A GOOG 0.10

A GRPN 0.00

B IBM 0.40

B MSFT 0.60

B GOOG 0.00

B GRPN 0.00

C IBM 0.50

C MSFT 0.00

C GOOG 0.00

C GRPN 0.50

;

run;

datab; set a;

procsort; bycompanyID;

proctranspose out=outb;bycompanyID;id investor; varwt; *outb gives the weightsmatrix, columns are investors and rows are companies;

title'data outb';

procprint data=outb;run;

So outb looks like this:

Obs ID _NAME_ A B C

1 GOOG wt 0.1 0.0 0.0

2 GRPN wt 0.0 0.0 0.5

3 IBM wt 0.5 0.4 0.5

4 MSFT wt 0.4 0.6 0.0

If I could define outb as a 4X3 matrix, I’d like to compute a similarity matrix defined as H(transpose)H/(Euclidian column norm*Euclidian columnnorm)

The resulting similarity matrix is (i.e., what I would liketo compute)

	A	B	C
A	1	0.941513	0.545545
B	0.941513	1	0.392232
C	0.545545	0.392232	1

For instance the last element in the first row (A and C’ssimilarity, 0.545545) is

[(0.1*0)+(0*0.5)+(0.5*0.5)+(0.4*0)] / {A} {C}

Where {A} is the square root of the sum of A’s squaredweights=sqrt[0.1*0.1 + 0*0 + 0.5*0.5 + 0.4*0.4]

And {C} is the square root of the sum of C’s squaredweights=sqrt[0*0 + 0.5*0.5 + 0.5*0.5 + 0*0]

This seems like a problem well-suited for proc IML. Anysuggestions on how I’d approach the coding? In addition, my actual data isquite large, e.g., approximately 3000 stocks and 3000 investors, implying 9M elementsin the similarity matrix.

Thanks,

Rick

art297 · Posted 12-29-2011 01:59 PM

Have you looked at proc similarity? e.g., http://support.sas.com/rnd/app/papers/similarityanalysis.pdf

coug914 · Posted 12-29-2011 03:14 PM

Thanks Art - I had taken a look at that document (and a few others regarding proc similarity). Although it seems related, I didn't see a way to use it to compute what I wanted. (Of course, it's very possible it can and I'm just not that familar with this area.)

Rick_SAS · Posted 12-29-2011 02:30 PM

You can certainly do these computations with SAS/IML. Let x be the matrix that you have. Scale each column by dividing by the column norm (this is equivalent to standardizing each variable to have std dev = 1). Then form the cross products:

proc iml;

x = {
0.1 0.0   0.0,
0.0 0.0   0.5,
0.5 0.4   0.5,
0.4 0.6   0.0};

G = x / sqrt(x[##,]); /* standardize */
m = G`*G; /* cross products */
print G, m;

I am not familiar with the SIMILARITY procedure, which is part of SAS/ETS software. However, there is a paper with examples on similarity analysis in SAS:http://support.sas.com/rnd/app/papers/similarityanalysis.pdf

coug914 · Posted 12-29-2011 03:12 PM

Rick - thank you - that certainly seems to work. Can I ask three other questions?

1. Given my "x" matrix will be 5,000 X 3,000, I can't define x as you have above. Given, however, i have the data outb, is there a way to tell proc iml to use that as the matrix? That is, in proc IML can I define the data from a previous data step as a matrix?

2. Once I get m (which will be approximately 9M elements, i.e., 3K X 3K), i'd like to (a) identify the rows (or columns), e.g., know that the first row is investor A, the second row is investor B. Is there a way to do that?

3. I'd like to compute the average similarity element for each manager (i.e., either the average element in each row of m or each column of m) and the average element in the whole matrix.Is there a way to do that? I think if I could get sas to see m as a dataset, I could write code to take the averages...so kind of the oposite of my first question, can I get sas to see "m" as a data set?

Thanks again for your help...

Rick

Rick_SAS · Posted 12-30-2011 06:45 AM

1) To read a SAS data set into a SAS/IML matrix, see http://blogs.sas.com/content/iml/2010/10/04/reading-sas-data-sets/

2) The names of the investors are the names of the variables that form the columns of the data that you are reading in. On the READ statement, use [COLNAME=InvestorNames] after you name the matrix. That stores the investor names in a vector.

READ all var _NUM_ into X[COLNAME=InvestorNames];

See http://blogs.sas.com/content/iml/2010/12/08/computing-covariance-and-correlation-matrices/ for an example.

3) You can write the m matrix to a SAS data set by using the CREATE and APPEND statements. See http://blogs.sas.com/content/iml/2011/04/18/writing-data-from-a-matrix-to-a-sas-data-set/

Incidentally, is there a reason that you don't want to center the variables? If you center the variables by subtracting the column means, the similarity matrix is just the ordinary correlation matrix, which you can compute with PROC CORR.

coug914 · Posted 12-30-2011 10:17 AM

thanks - I'm also computing average correlation, but am following another study in computing the similarity matrix.At least one source suggests that similarity matricies should have all non-negative elements (unlike correlation).

Rick_SAS · Posted 12-30-2011 11:13 AM

Yes. You can show mathematically that the entries of G`*G will be non-negative.

Ksharp · Posted 12-30-2011 01:37 AM

How About:

data a; input investor $ companyID $ wt;
cards;
A IBM 0.50
A MSFT 0.40
A GOOG 0.10
A GRPN 0.00
B IBM 0.40
B MSFT 0.60
B GOOG 0.00
B GRPN 0.00
C IBM 0.50
C MSFT 0.00
C GOOG 0.00
C GRPN 0.50
;
run;


proc sort data=a; by companyID;run;
proc transpose data=a out=temp(drop=_name_);
 by companyid;
 id investor;
 var wt;
run;
proc corr data=temp outp=corr(drop=_name_ where=(_type_='CORR')) noprint;
 var _numeric_;
run;
data _null_;
 set corr end=last;
 retain sum .;
 array _a{*} _numeric_;
 i=1;
 do while(_a{i} ne 1);
  sum=sum(sum,_a{i});i+1;n+1;
 end;
 if last then do;
               mean=sum/n; 
               put 'ave_corr= ' mean;
              end;
run;

options nomprint nomlogic nosymbolgen;
/*compute the sum of the products of the weights for 
each pair of investors */
%macro across;
proc sql noprint;
 select distinct investor into : list separated by ' ' from a;

create table want as
 select 
        %do i=1 %to %sysfunc(countw(&list));
          %do j=%eval(&i+1) %to %sysfunc(countw(&list));
           sum(%scan(&list,&i)*%scan(&list,&j))/(sqrt(uss(%scan(&list,&i)))*sqrt(uss(%scan(&list,&j)))) as %scan(&list,&i)%scan(&list,&j)
           %if &i ne %eval(%sysfunc(countw(&list))-1) or  &j ne %sysfunc(countw(&list)) %then %do;,%end;
          %end;
        %end;
  from temp;
quit;
%mend across;

%across

Ksharp

Ksharp · Posted 12-30-2011 03:58 AM

How about:

data a; input investor $ companyID $ wt;
cards;
A IBM 0.50
A MSFT 0.40
A GOOG 0.10
A GRPN 0.00
B IBM 0.40
B MSFT 0.60
B GOOG 0.00
B GRPN 0.00
C IBM 0.50
C MSFT 0.00
C GOOG 0.00
C GRPN 0.50
;
run;
proc sort data=a; by companyID;run;
proc transpose data=a out=temp(drop=_name_);
 by companyid;
 id investor;
 var wt;
run;
data _null_;
 set temp;
 call symputx(cats('list',_n_),catx(' ',of _numeric_));
 call symputx('end',_n_);
run;


%macro across;
proc iml;

x = {
%do i=1 %to  &end ;
 &&list&i %if &i ne &end %then %do;,%end;
%end;
};
G = x / sqrt(x[##,]); /* standardize*/ 
m = G`*G; /* cross products*/ 
print G, m;quit;
%mend across;

%across

Ksharp

coug914 · Posted 01-01-2012 03:52 PM

What worked and what didn’t

Using the sql approach did not “work” with the large dataset—usingone quarter of actual data (2643 investors and 3685 stocks), the code ran for sixhours before I stopped it. Given I have 50 quarters of data, it would just taketoo long. I also tried my own (actually one of my co-authors) “old school” codewhere we ran loops for every manager with every other manager. That too, tookway to long (my estimate was it would take 37 days for the code to run).

I finally went back to the IML. That really seemed to workwell – thanks both Ksharp and Rick. It took about 3 minutes to handle the matrixcalculations with the large dataset. Also using the IML ncol (number ofcolumns), vecdiag (creates a vector out of the diagonal of the matrix), and sumfunctions allowed me to easily estimate the sum of the elements of the matrixand subtract the sum of the diagonal. The average off diagonal element is thenthe [sum(all)-sum(diagonal)]/(N*(N-1)).

What I can’t seem to get to work is to output the results toa sas dataset. I tried inserting the code suggested by Rick, but with Ksharp’smacro, SAS was unhappy (the createstatements are commented out below). I’dlike to be able to do this, because in my real data, I’ll set a loop aroundthis whole thing and run it for each of the 50 quarters of data.

Thanks again for all your help and happy 2012!

Rick

optionsreplace;

optionsnocenter;

procdatasets kill;

dataa; input investor $ companyID $ wt;

cards;

A IBM 0.5

A MSFT 0.2

A GOOG 0.1

A GRPN 0

A F 0.2

B IBM 0.4

B MSFT 0.5

B GOOG 0

B GRPN 0

B F 0.1

C IBM 0.5

C MSFT 0

C GOOG 0

C GRPN 0.5

C F 0

D IBM 0.1

D MSFT 0.25

D GOOG 0.25

D GRPN 0.1

D F 0.3

;

run;

procsort data=a;by companyID;run;

proctranspose data=aout=temp(drop=_name_);

bycompanyid;

idinvestor;

varwt;

run;

data_null_; settemp;

callsymputx(cats('list',_n_),catx('',of _numeric_));

callsymputx('end',_n_);

run;

%macroacross;

proc iml;

x = {

%doi=1 %to &end ;

&&list&i %if&i ne &end %then %do;,%end;

%end;

};

G = x /sqrt(x[##,]); /* standardize*/

m = G`*G; /*cross products*/

number_columns=ncol(m);

d=vecdiag(m);

sum_all=sum(m);

sum_diag=sum(d);

ave_off_diag=(sum_all-sum_diag)/(number_columns*(number_columns-1));

printnumber_columns sum_all sum_diag ave_off_diag;

/*

create MyDatavar {number_columns sum_all sum_diag};

append;

close MyData;

*/

quit;

%mendacross;

%across

run;

Rick_SAS · Posted 01-02-2012 08:42 AM

There's no reason to use a macro loop to read data into an IML matrix.

Currently you are converting data into character values, concatenating them into (thousands of) macro variables,

and then using those macro variables to define rows of a matrix.

A more efficient approach is to read the numerical values directly from TEMP:

use temp;

read all var _NUM_ into x;

close temp;

I don't know why your CREATE statement isn't working inside the macro loop (which is one of the problems with macros: they are hard to debug!). It works when I do it my way.

coug914 · Posted 01-02-2012 12:23 PM

Thanks Rick - that worked well. I also figured out that I could easily compute the average element for each investor and output that to a dataset. I think I'm all set. Here's the final code I came up with.

Rick

data temp; input companyID $ A B C D;

cards;

F 0.2 0.1 0.0 0.30

GOOG 0.1 0.0 0.0 0.25

GRPN 0.0 0.0 0.5 0.10

IBM 0.5 0.4 0.5 0.10

MSFT 0.2 0.5 0.0 0.25

;

run;

proc iml;

use temp;

read all var _NUM_ into x[COLNAME=InvestorNames];

close temp;

G = x / sqrt(x[##,]); /* standardize*/

m = G`*G; /* cross products*/

number_columns=ncol(m);

d=vecdiag(m);

sum_all=sum(m);

sum_diag=sum(d);

ave_off_diag=(sum_all-sum_diag)/(number_columns*(number_columns-1));

col_mean=m[,:];

investor=t(InvestorNames);

create investor_data var {investor col_mean};

append;

close investor_data;

create matrix_data var {number_columns sum_all sum_diag ave_off_diag};

append;

close matrix_data;

quit;

run;

proc print data=investor_data;

proc print data=matrix_data;

run;

freeliu · Posted 02-02-2015 03:44 AM

Here is a new question. After similarity matrix was gotten, such as Matrix m in this example. How could I output it and reuse this similarity matrix to calculate other vecor? For example

data temp; input investor $ flow;

cards;

A 0.1

B 0.4

C -0.8

D 0.11

E 0.12

F 0.13

;

run;

I just want to calculate a new variable m*flow. Would please give me help ? Thanks!

freeliu · Posted 02-02-2015 04:49 AM

Hi, here is the example; The Vector if following

NAME	y
fdcd_000001	0.006157039
fdcd_000011	0.001510721
fdcd_000021	0.35458659
fdcd_000031	-0.04071315
fdcd_000041	-0.01494467
fdcd_000051	0

The matrix is( Note: the colname of matrix is a mess of vector. Their orders are different). How to give their Product? Thanks!

fdcd_000011	fdcd_000001	fdcd_000021	fdcd_000031	fdcd_000041	fdcd_000051
0.79	0.01	1.09	3.75	0.82	0.43
1.4	0.23	0.86	0.88	1.44	1.19
0	0	0	0	0	0
0	0.09	0	0	0	0.02
0.33	0	0	0	0.33	0.29
0.28	0	0	0	0.27	0.42
0	0	0	0	0	0
0	0	0	0	0	0.04
0	0	0	0	0	0
0.14	0.19	0	0	0.14	0.1
0	0	0	0	0	0
0.19	0	0	0	0.19	0.17
0	0	0	0	0	0
0.12	0.1	0	0	0.12	0.15
0	0	0	0	0	0.06
0	0	0	0	0	0
0.1	0	0	0	0.1	0.09