BookmarkSubscribeRSS Feed
joshmazin
Calcite | Level 5

Hi

 

I am familiar with SQL and VBA, and now trying to learn how to use SAS. I would appreciate your help.

 

I have a data set that is two columns: column ID (set to 1 or 2) and column Value.

Let's say I have 100 observations.

Let's call this data set as array(x,y) where x is rows so goes from 1 to 100, and y is columns so goes from 1 to 2. array(x,1) is the ID column, and array(x,2) is the Value column.

 

I also need a correlation matrix between each of the 100 observations, say corr(i,j).

 

I want to produce the sum of the values of the 100 observations allowing for the correlations between them.

 

I can achieve this by doing matrix multiplication of array(x,2) with itself transposed and with corr(i,j), square-rooted.

 

If I was working in VBA I may do something like:

Total1 = 0

Total2 = 0

For a = 1 to 100

   For b = 1 to 100

      result = array(a,2) * array(b,2) * corr(a,b)

      Total1 = Total1 + result

   Next b

   Total2 = Total2 + Total1

Next a

Total = Total2^(1/2)

 

corr(i,j) needs to be set..

- to 1 where i=j,

- to 0.5 where i <> j and array(i,1) <> array(j,1) i.e. different IDs,

- to 0 otherwise

 

How can I do this in SAS Enterprise? Is using a Do query the best approach or is there a direct matrix multiplication approach? 

 

Many thanks in advance

Josh

 

10 REPLIES 10
FreelanceReinh
Jade | Level 19

Hello @joshmazin and welcome to the SAS Support Communities!

 

I was wondering if Total1 in your VBA code doesn't need to be reset to 0 after each iteration of the outer loop. Assuming that this should be the case, the computation could be coded in SAS like this:

 

/* Create sample data for demonstration */

data have;
call streaminit(27182818);
do _n_=1 to 100;
  id=rand('table',0.5);
  value=rand('integer',1000);
  output;
end;
run;

/* Determine number of observations in HAVE */

data _null_;
call symputx('d',n);
stop;
set have nobs=n;
run;

/* Perform the computation */

data want(keep=total);
array v[&d,2] _temporary_;
set have end=last;
v[_n_,1]=id;
v[_n_,2]=value;
if last;
do i=1 to &d;
  do j=1 to &d;
    if i=j then total+v[i,2]*v[j,2];
    else if v[i,1] ne v[j,1] then total+v[i,2]*v[j,2]/2;
  end;
end;
total=sqrt(total);
run;

The last DATA step writes the IDs and values to a temporary array v, then computes the sum of all products of the form "value1*value2*corr" and finally writes the square root of that total to dataset WANT.

 

joshmazin
Calcite | Level 5

Thank you very much. I will try to apply this to my full code and will update you

Reeza
Super User

If you're looking for a good reference I highly recommend this one:

https://www.amazon.ca/Simulating-Data-SAS-Rick-Wicklin/dp/1612903320

 

His blog posts will also have a lot of good posts on simulating data in SAS. Typically, if you're doing that type of work you're usually using IML, not Base SAS.

https://blogs.sas.com/content/author/rickwicklin/

 


@joshmazin wrote:

Hi

 

I am familiar with SQL and VBA, and now trying to learn how to use SAS. I would appreciate your help.

 

I have a data set that is two columns: column ID (set to 1 or 2) and column Value.

Let's say I have 100 observations.

Let's call this data set as array(x,y) where x is rows so goes from 1 to 100, and y is columns so goes from 1 to 2. array(x,1) is the ID column, and array(x,2) is the Value column.

 

I also need a correlation matrix between each of the 100 observations, say corr(i,j).

 

I want to produce the sum of the values of the 100 observations allowing for the correlations between them.

 

I can achieve this by doing matrix multiplication of array(x,2) with itself transposed and with corr(i,j), square-rooted.

 

If I was working in VBA I may do something like:

Total1 = 0

Total2 = 0

For a = 1 to 100

   For b = 1 to 100

      result = array(a,2) * array(b,2) * corr(a,b)

      Total1 = Total1 + result

   Next b

   Total2 = Total2 + Total1

Next a

Total = Total2^(1/2)

 

corr(i,j) needs to be set..

- to 1 where i=j,

- to 0.5 where i <> j and array(i,1) <> array(j,1) i.e. different IDs,

- to 0 otherwise

 

How can I do this in SAS Enterprise? Is using a Do query the best approach or is there a direct matrix multiplication approach? 

 

Many thanks in advance

Josh

 


 

PGStats
Opal | Level 21

Or if you prefer an equivalent calculation in SQL:

 

proc sql;
select 
    sqrt(uss(a.value)/sqrt(count(a.value)) +
    sum(a.value*b.value*(a.id < b.id)))
from have as a, have as b;
quit;

Note: USS(x) is the same as SUM(x*x).

Note: In SAS, logical expressions evaluate to 1 when true and 0 otherwise.

PG
FreelanceReinh
Jade | Level 19

Hi @PGStats,

 

This is ingenious! Somehow I didn't grasp the meaning of those "correlations" (too tired I guess) and thus overlooked the possible simplification. Likewise, the idea of a cross join escaped me. Kudos!

PGStats
Opal | Level 21

Thank you @FreelanceReinh . It helped that I could validate my results against yours.

PG
joshmazin
Calcite | Level 5
Thank you very much. I will try to apply this to my full code.
Could you explain what you are doing at each stage of your calc as I am not familiar with some of these functions?
ALso I dont get the same result using your approach as the previous poster's post, despite your comment that you checked yours against theirs?
Many thanks
joshmazin
Calcite | Level 5
Will these approaches still work if i have 4 million observations instead of 100?
how long should it take to calculate for 4 million observations?
FreelanceReinh
Jade | Level 19

@joshmazin wrote:
Will these approaches still work if i have 4 million observations instead of 100?
how long should it take to calculate for 4 million observations?

At the very least you should then take advantage of symmetry. In terms of my solution this could look like this:

data want(keep=total);
array v[&d,2] _temporary_;
set have end=last;
v[_n_,1]=id;
v[_n_,2]=value;
if last;
do i=1 to &d;
  total+v[i,2]**2;
  do j=i+1 to &d;
    if v[i,1] ne v[j,1] then total+v[i,2]*v[j,2];
  end;
end;
total=sqrt(total);
run;

But it's well possible that you'll need to further improve efficiency, perhaps by switching to SAS/IML (which I don't have) as suggested by Reeza. In any case I would increase the number n of observations gradually, observe the run time in each step and try to extrapolate (based on the assumption that it will be O(n²)).

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 10 replies
  • 1210 views
  • 1 like
  • 4 in conversation