Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 12-29-2011 12:34 PM
(3685 views)

**Iam not familiar with IML, but think it may be well suited for computing asimilarity matrix. Specifically, I have large stack of data that consists of investors, stocks, and the investors’ weights in the stocks. The first part of my code unstacks the data and puts it into what looks like a 4X3 holdings matrix (data outb, which I’ll call “H”).**

** **

**data**a; input investor $ companyID $ wt;

cards;

A IBM 0.50

A MSFT 0.40

A GOOG 0.10

A GRPN 0.00

B IBM 0.40

B MSFT 0.60

B GOOG 0.00

B GRPN 0.00

C IBM 0.50

C MSFT 0.00

C GOOG 0.00

C GRPN 0.50

;

**run**;

**data**b; set a;

**procsort**; bycompanyID;

**proctranspose** out=outb;bycompanyID;id investor; varwt; *outb gives the weightsmatrix, columns are investors and rows are companies;

title'data outb';

**procprint** data=outb;**run**;

So outb looks like this:

Obs ID _NAME_ A B C

1 GOOG wt 0.1 0.0 0.0

2 GRPN wt 0.0 0.0 0.5

3 IBM wt 0.5 0.4 0.5

4 MSFT wt 0.4 0.6 0.0

If I could define outb as a 4X3 matrix, I’d like to compute a similarity matrix defined as H(transpose)H/(Euclidian column norm*Euclidian columnnorm)

The resulting similarity matrix is (i.e., what I would liketo compute)

A | B | C | |

A | 1 | 0.941513 | 0.545545 |

B | 0.941513 | 1 | 0.392232 |

C | 0.545545 | 0.392232 | 1 |

For instance the last element in the first row (A and C’ssimilarity, 0.545545) is

[(0.1*0)+(0*0.5)+(0.5*0.5)+(0.4*0)] / {A} {C}

Where {A} is the square root of the sum of A’s squaredweights=sqrt[0.1*0.1 + 0*0 + 0.5*0.5 + 0.4*0.4]

And {C} is the square root of the sum of C’s squaredweights=sqrt[0*0 + 0.5*0.5 + 0.5*0.5 + 0*0]

This seems like a problem well-suited for proc IML. Anysuggestions on how I’d approach the coding? In addition, my actual data isquite large, e.g., approximately 3000 stocks and 3000 investors, implying 9M elementsin the similarity matrix.

Thanks,

Rick

17 REPLIES 17

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

You can certainly do these computations with SAS/IML. Let x be the matrix that you have. Scale each column by dividing by the column norm (this is equivalent to standardizing each variable to have std dev = 1). Then form the cross products:

proc iml;

x = {

0.1 0.0 0.0,

0.0 0.0 0.5,

0.5 0.4 0.5,

0.4 0.6 0.0};

G = x / sqrt(x[##,]); /* standardize */

m = G`*G; /* cross products */

print G, m;

I am not familiar with the SIMILARITY procedure, which is part of SAS/ETS software. However, there is a paper with examples on similarity analysis in SAS:http://support.sas.com/rnd/app/papers/similarityanalysis.pdf

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Rick - thank you - that certainly seems to work. Can I ask three other questions?

1. Given my "x" matrix will be 5,000 X 3,000, I can't define x as you have above. Given, however, i have the data outb, is there a way to tell proc iml to use that as the matrix? That is, in proc IML can I define the data from a previous data step as a matrix?

2. Once I get m (which will be approximately 9M elements, i.e., 3K X 3K), i'd like to (a) identify the rows (or columns), e.g., know that the first row is investor A, the second row is investor B. Is there a way to do that?

3. I'd like to compute the average similarity element for each manager (i.e., either the average element in each row of m or each column of m) and the average element in the whole matrix.Is there a way to do that? I think if I could get sas to see m as a dataset, I could write code to take the averages...so kind of the oposite of my first question, can I get sas to see "m" as a data set?

Thanks again for your help...

Rick

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

1) To read a SAS data set into a SAS/IML matrix, see http://blogs.sas.com/content/iml/2010/10/04/reading-sas-data-sets/

2) The names of the investors are the names of the variables that form the columns of the data that you are reading in. On the READ statement, use [COLNAME=InvestorNames] after you name the matrix. That stores the investor names in a vector.

READ all var _NUM_ into X[COLNAME=InvestorNames];

See http://blogs.sas.com/content/iml/2010/12/08/computing-covariance-and-correlation-matrices/ for an example.

3) You can write the m matrix to a SAS data set by using the CREATE and APPEND statements. See http://blogs.sas.com/content/iml/2011/04/18/writing-data-from-a-matrix-to-a-sas-data-set/

Incidentally, is there a reason that you don't want to center the variables? If you center the variables by subtracting the column means, the similarity matrix is just the ordinary correlation matrix, which you can compute with PROC CORR.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Yes. You can show mathematically that the entries of G`*G will be non-negative.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

How About:

data a; input investor $ companyID $ wt; cards; A IBM 0.50 A MSFT 0.40 A GOOG 0.10 A GRPN 0.00 B IBM 0.40 B MSFT 0.60 B GOOG 0.00 B GRPN 0.00 C IBM 0.50 C MSFT 0.00 C GOOG 0.00 C GRPN 0.50 ; run; proc sort data=a; by companyID;run; proc transpose data=a out=temp(drop=_name_); by companyid; id investor; var wt; run; proc corr data=temp outp=corr(drop=_name_ where=(_type_='CORR')) noprint; var _numeric_; run; data _null_; set corr end=last; retain sum .; array _a{*} _numeric_; i=1; do while(_a{i} ne 1); sum=sum(sum,_a{i});i+1;n+1; end; if last then do; mean=sum/n; put 'ave_corr= ' mean; end; run; options nomprint nomlogic nosymbolgen; /*compute the sum of the products of the weights for each pair of investors */ %macro across; proc sql noprint; select distinct investor into : list separated by ' ' from a; create table want as select %do i=1 %to %sysfunc(countw(&list)); %do j=%eval(&i+1) %to %sysfunc(countw(&list)); sum(%scan(&list,&i)*%scan(&list,&j))/(sqrt(uss(%scan(&list,&i)))*sqrt(uss(%scan(&list,&j)))) as %scan(&list,&i)%scan(&list,&j) %if &i ne %eval(%sysfunc(countw(&list))-1) or &j ne %sysfunc(countw(&list)) %then %do;,%end; %end; %end; from temp; quit; %mend across; %across

Ksharp

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

How about:

data a; input investor $ companyID $ wt; cards; A IBM 0.50 A MSFT 0.40 A GOOG 0.10 A GRPN 0.00 B IBM 0.40 B MSFT 0.60 B GOOG 0.00 B GRPN 0.00 C IBM 0.50 C MSFT 0.00 C GOOG 0.00 C GRPN 0.50 ; run; proc sort data=a; by companyID;run; proc transpose data=a out=temp(drop=_name_); by companyid; id investor; var wt; run; data _null_; set temp; call symputx(cats('list',_n_),catx(' ',of _numeric_)); call symputx('end',_n_); run; %macro across; proc iml; x = { %do i=1 %to &end ; &&list&i %if &i ne &end %then %do;,%end; %end; }; G = x / sqrt(x[##,]); /* standardize*/ m = G`*G; /* cross products*/ print G, m;quit; %mend across; %across

Ksharp

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

What worked and what didn’t

Using the sql approach did not “work” with the large dataset—usingone quarter of actual data (2643 investors and 3685 stocks), the code ran for sixhours before I stopped it. Given I have 50 quarters of data, it would just taketoo long. I also tried my own (actually one of my co-authors) “old school” codewhere we ran loops for every manager with every other manager. That too, tookway to long (my estimate was it would take 37 days for the code to run).

I finally went back to the IML. That really seemed to workwell – thanks both Ksharp and Rick. It took about 3 minutes to handle the matrixcalculations with the large dataset. Also using the IML ncol (number ofcolumns), vecdiag (creates a vector out of the diagonal of the matrix), and sumfunctions allowed me to easily estimate the sum of the elements of the matrixand subtract the sum of the diagonal. The average off diagonal element is thenthe [sum(all)-sum(diagonal)]/(N*(N-1)).

What I can’t seem to get to work is to output the results toa sas dataset. I tried inserting the code suggested by Rick, but with Ksharp’smacro, SAS was unhappy (the createstatements are commented out below). I’dlike to be able to do this, because in my real data, I’ll set a loop aroundthis whole thing and run it for each of the 50 quarters of data.

Thanks again for all your help and happy 2012!

Rick

optionsreplace;

optionsnocenter;

**procdatasets** kill;

**data**a; input investor $ companyID $ wt;

cards;

A IBM 0.5

A MSFT 0.2

A GOOG 0.1

A GRPN 0

A F 0.2

B IBM 0.4

B MSFT 0.5

B GOOG 0

B GRPN 0

B F 0.1

C IBM 0.5

C MSFT 0

C GOOG 0

C GRPN 0.5

C F 0

D IBM 0.1

D MSFT 0.25

D GOOG 0.25

D GRPN 0.1

D F 0.3

;

**run**;

**procsort** data=a;by companyID;**run**;

**proctranspose** data=aout=temp(drop=_name_);

bycompanyid;

idinvestor;

varwt;

**run**;

**data**_null_; settemp;

callsymputx(cats('list',_n_),catx('',of _numeric_));

callsymputx('end',_n_);

**run**;

**%macro**** across**;

proc iml;

x = {

%doi=**1** %to &end ;

&&list&i %if&i ne &end %then %do;,%end;

%end;

};

G = x /sqrt(x[##,]); /* standardize*/

m = G`*G; /*cross products*/

number_columns=ncol(m);

d=vecdiag(m);

sum_all=sum(m);

sum_diag=sum(d);

ave_off_diag=(sum_all-sum_diag)/(number_columns*(number_columns-**1**));

printnumber_columns sum_all sum_diag ave_off_diag;

/*

create MyDatavar {number_columns sum_all sum_diag};

append;

close MyData;

*/

quit;

**%mend**across;

%*across*

**run**;

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

There's no reason to use a macro loop to read data into an IML matrix.

Currently you are converting data into character values, concatenating them into (thousands of) macro variables,

and then using those macro variables to define rows of a matrix.

A more efficient approach is to read the numerical values directly from TEMP:

use temp;

read all var _NUM_ into x;

close temp;

I don't know why your CREATE statement isn't working inside the macro loop (which is one of the problems with macros: they are hard to debug!). It works when I do it my way.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thanks Rick - that worked well. I also figured out that I could easily compute the average element for each investor and output that to a dataset. I think I'm all set. Here's the final code I came up with.

Rick

data temp; input companyID $ A B C D;

cards;

F 0.2 0.1 0.0 0.30

GOOG 0.1 0.0 0.0 0.25

GRPN 0.0 0.0 0.5 0.10

IBM 0.5 0.4 0.5 0.10

MSFT 0.2 0.5 0.0 0.25

;

run;

proc iml;

use temp;

read all var _NUM_ into x[COLNAME=InvestorNames];

close temp;

G = x / sqrt(x[##,]); /* standardize*/

m = G`*G; /* cross products*/

number_columns=ncol(m);

d=vecdiag(m);

sum_all=sum(m);

sum_diag=sum(d);

ave_off_diag=(sum_all-sum_diag)/(number_columns*(number_columns-1));

col_mean=m[,:];

investor=t(InvestorNames);

create investor_data var {investor col_mean};

append;

close investor_data;

create matrix_data var {number_columns sum_all sum_diag ave_off_diag};

append;

close matrix_data;

quit;

run;

proc print data=investor_data;

proc print data=matrix_data;

run;

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Here is a new question. After similarity matrix was gotten, such as Matrix m in this example. How could I output it and reuse this similarity matrix to calculate other vecor? For example

data temp; input investor $ flow;

cards;

A 0.1

B 0.4

C -0.8

D 0.11

E 0.12

F 0.13

;

run;

I just want to calculate a new variable m*flow. Would please give me help ? Thanks!

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hi, here is the example; The Vector if following

NAME | y |

fdcd_000001 | 0.006157039 |

fdcd_000011 | 0.001510721 |

fdcd_000021 | 0.35458659 |

fdcd_000031 | -0.04071315 |

fdcd_000041 | -0.01494467 |

fdcd_000051 | 0 |

The matrix is( Note: the colname of matrix is a mess of vector. Their orders are different). How to give their Product? Thanks!

fdcd_000011 | fdcd_000001 | fdcd_000021 | fdcd_000031 | fdcd_000041 | fdcd_000051 |

0.79 | 0.01 | 1.09 | 3.75 | 0.82 | 0.43 |

1.4 | 0.23 | 0.86 | 0.88 | 1.44 | 1.19 |

0 | 0 | 0 | 0 | 0 | 0 |

0 | 0.09 | 0 | 0 | 0 | 0.02 |

0.33 | 0 | 0 | 0 | 0.33 | 0.29 |

0.28 | 0 | 0 | 0 | 0.27 | 0.42 |

0 | 0 | 0 | 0 | 0 | 0 |

0 | 0 | 0 | 0 | 0 | 0.04 |

0 | 0 | 0 | 0 | 0 | 0 |

0.14 | 0.19 | 0 | 0 | 0.14 | 0.1 |

0 | 0 | 0 | 0 | 0 | 0 |

0.19 | 0 | 0 | 0 | 0.19 | 0.17 |

0 | 0 | 0 | 0 | 0 | 0 |

0.12 | 0.1 | 0 | 0 | 0.12 | 0.15 |

0 | 0 | 0 | 0 | 0 | 0.06 |

0 | 0 | 0 | 0 | 0 | 0 |

0.1 | 0 | 0 | 0 | 0.1 | 0.09 |

**Available on demand!**

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

Multiple Linear Regression in SAS

Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.

Find more tutorials on the SAS Users YouTube channel.