BookmarkSubscribeRSS Feed
joshmazin
Calcite | Level 5

Hi

 

I am familiar with SQL and VBA, and now trying to learn how to use SAS. I would appreciate your help.

 

I have a data set that is two columns: column ID (set to 1 or 2) and column Value.

Let's say I have 100 observations.

Let's call this data set as array(x,y) where x is rows so goes from 1 to 100, and y is columns so goes from 1 to 2. array(x,1) is the ID column, and array(x,2) is the Value column.

 

I also need a correlation matrix between each of the 100 observations, say corr(i,j).

 

I want to produce the sum of the values of the 100 observations allowing for the correlations between them.

 

I can achieve this by doing matrix multiplication of array(x,2) with itself transposed and with corr(i,j), square-rooted.

 

If I was working in VBA I may do something like:

Total1 = 0

Total2 = 0

For a = 1 to 100

   For b = 1 to 100

      result = array(a,2) * array(b,2) * corr(a,b)

      Total1 = Total1 + result

   Next b

   Total2 = Total2 + Total1

Next a

Total = Total2^(1/2)

 

corr(i,j) needs to be set..

- to 1 where i=j,

- to 0.5 where i <> j and array(i,1) <> array(j,1) i.e. different IDs,

- to 0 otherwise

 

How can I do this in SAS Enterprise? Is using a Do query the best approach or is there a direct matrix multiplication approach? 

 

Many thanks in advance

Josh

 

10 REPLIES 10
FreelanceReinh
Jade | Level 19

Hello @joshmazin and welcome to the SAS Support Communities!

 

I was wondering if Total1 in your VBA code doesn't need to be reset to 0 after each iteration of the outer loop. Assuming that this should be the case, the computation could be coded in SAS like this:

 

/* Create sample data for demonstration */

data have;
call streaminit(27182818);
do _n_=1 to 100;
  id=rand('table',0.5);
  value=rand('integer',1000);
  output;
end;
run;

/* Determine number of observations in HAVE */

data _null_;
call symputx('d',n);
stop;
set have nobs=n;
run;

/* Perform the computation */

data want(keep=total);
array v[&d,2] _temporary_;
set have end=last;
v[_n_,1]=id;
v[_n_,2]=value;
if last;
do i=1 to &d;
  do j=1 to &d;
    if i=j then total+v[i,2]*v[j,2];
    else if v[i,1] ne v[j,1] then total+v[i,2]*v[j,2]/2;
  end;
end;
total=sqrt(total);
run;

The last DATA step writes the IDs and values to a temporary array v, then computes the sum of all products of the form "value1*value2*corr" and finally writes the square root of that total to dataset WANT.

 

joshmazin
Calcite | Level 5

Thank you very much. I will try to apply this to my full code and will update you

Reeza
Super User

If you're looking for a good reference I highly recommend this one:

https://www.amazon.ca/Simulating-Data-SAS-Rick-Wicklin/dp/1612903320

 

His blog posts will also have a lot of good posts on simulating data in SAS. Typically, if you're doing that type of work you're usually using IML, not Base SAS.

https://blogs.sas.com/content/author/rickwicklin/

 


@joshmazin wrote:

Hi

 

I am familiar with SQL and VBA, and now trying to learn how to use SAS. I would appreciate your help.

 

I have a data set that is two columns: column ID (set to 1 or 2) and column Value.

Let's say I have 100 observations.

Let's call this data set as array(x,y) where x is rows so goes from 1 to 100, and y is columns so goes from 1 to 2. array(x,1) is the ID column, and array(x,2) is the Value column.

 

I also need a correlation matrix between each of the 100 observations, say corr(i,j).

 

I want to produce the sum of the values of the 100 observations allowing for the correlations between them.

 

I can achieve this by doing matrix multiplication of array(x,2) with itself transposed and with corr(i,j), square-rooted.

 

If I was working in VBA I may do something like:

Total1 = 0

Total2 = 0

For a = 1 to 100

   For b = 1 to 100

      result = array(a,2) * array(b,2) * corr(a,b)

      Total1 = Total1 + result

   Next b

   Total2 = Total2 + Total1

Next a

Total = Total2^(1/2)

 

corr(i,j) needs to be set..

- to 1 where i=j,

- to 0.5 where i <> j and array(i,1) <> array(j,1) i.e. different IDs,

- to 0 otherwise

 

How can I do this in SAS Enterprise? Is using a Do query the best approach or is there a direct matrix multiplication approach? 

 

Many thanks in advance

Josh

 


 

PGStats
Opal | Level 21

Or if you prefer an equivalent calculation in SQL:

 

proc sql;
select 
    sqrt(uss(a.value)/sqrt(count(a.value)) +
    sum(a.value*b.value*(a.id < b.id)))
from have as a, have as b;
quit;

Note: USS(x) is the same as SUM(x*x).

Note: In SAS, logical expressions evaluate to 1 when true and 0 otherwise.

PG
FreelanceReinh
Jade | Level 19

Hi @PGStats,

 

This is ingenious! Somehow I didn't grasp the meaning of those "correlations" (too tired I guess) and thus overlooked the possible simplification. Likewise, the idea of a cross join escaped me. Kudos!

PGStats
Opal | Level 21

Thank you @FreelanceReinh . It helped that I could validate my results against yours.

PG
joshmazin
Calcite | Level 5
Thank you very much. I will try to apply this to my full code.
Could you explain what you are doing at each stage of your calc as I am not familiar with some of these functions?
ALso I dont get the same result using your approach as the previous poster's post, despite your comment that you checked yours against theirs?
Many thanks
joshmazin
Calcite | Level 5
Will these approaches still work if i have 4 million observations instead of 100?
how long should it take to calculate for 4 million observations?
FreelanceReinh
Jade | Level 19

@joshmazin wrote:
Will these approaches still work if i have 4 million observations instead of 100?
how long should it take to calculate for 4 million observations?

At the very least you should then take advantage of symmetry. In terms of my solution this could look like this:

data want(keep=total);
array v[&d,2] _temporary_;
set have end=last;
v[_n_,1]=id;
v[_n_,2]=value;
if last;
do i=1 to &d;
  total+v[i,2]**2;
  do j=i+1 to &d;
    if v[i,1] ne v[j,1] then total+v[i,2]*v[j,2];
  end;
end;
total=sqrt(total);
run;

But it's well possible that you'll need to further improve efficiency, perhaps by switching to SAS/IML (which I don't have) as suggested by Reeza. In any case I would increase the number n of observations gradually, observe the run time in each step and try to extrapolate (based on the assumption that it will be O(n²)).