Hi
I am familiar with SQL and VBA, and now trying to learn how to use SAS. I would appreciate your help.
I have a data set that is two columns: column ID (set to 1 or 2) and column Value.
Let's say I have 100 observations.
Let's call this data set as array(x,y) where x is rows so goes from 1 to 100, and y is columns so goes from 1 to 2. array(x,1) is the ID column, and array(x,2) is the Value column.
I also need a correlation matrix between each of the 100 observations, say corr(i,j).
I want to produce the sum of the values of the 100 observations allowing for the correlations between them.
I can achieve this by doing matrix multiplication of array(x,2) with itself transposed and with corr(i,j), square-rooted.
If I was working in VBA I may do something like:
Total1 = 0
Total2 = 0
For a = 1 to 100
For b = 1 to 100
result = array(a,2) * array(b,2) * corr(a,b)
Total1 = Total1 + result
Next b
Total2 = Total2 + Total1
Next a
Total = Total2^(1/2)
corr(i,j) needs to be set..
- to 1 where i=j,
- to 0.5 where i <> j and array(i,1) <> array(j,1) i.e. different IDs,
- to 0 otherwise
How can I do this in SAS Enterprise? Is using a Do query the best approach or is there a direct matrix multiplication approach?
Many thanks in advance
Josh
Hello @joshmazin and welcome to the SAS Support Communities!
I was wondering if Total1 in your VBA code doesn't need to be reset to 0 after each iteration of the outer loop. Assuming that this should be the case, the computation could be coded in SAS like this:
/* Create sample data for demonstration */
data have;
call streaminit(27182818);
do _n_=1 to 100;
id=rand('table',0.5);
value=rand('integer',1000);
output;
end;
run;
/* Determine number of observations in HAVE */
data _null_;
call symputx('d',n);
stop;
set have nobs=n;
run;
/* Perform the computation */
data want(keep=total);
array v[&d,2] _temporary_;
set have end=last;
v[_n_,1]=id;
v[_n_,2]=value;
if last;
do i=1 to &d;
do j=1 to &d;
if i=j then total+v[i,2]*v[j,2];
else if v[i,1] ne v[j,1] then total+v[i,2]*v[j,2]/2;
end;
end;
total=sqrt(total);
run;
The last DATA step writes the IDs and values to a temporary array v, then computes the sum of all products of the form "value1*value2*corr" and finally writes the square root of that total to dataset WANT.
Thank you very much. I will try to apply this to my full code and will update you
If you're looking for a good reference I highly recommend this one:
https://www.amazon.ca/Simulating-Data-SAS-Rick-Wicklin/dp/1612903320
His blog posts will also have a lot of good posts on simulating data in SAS. Typically, if you're doing that type of work you're usually using IML, not Base SAS.
https://blogs.sas.com/content/author/rickwicklin/
@joshmazin wrote:
Hi
I am familiar with SQL and VBA, and now trying to learn how to use SAS. I would appreciate your help.
I have a data set that is two columns: column ID (set to 1 or 2) and column Value.
Let's say I have 100 observations.
Let's call this data set as array(x,y) where x is rows so goes from 1 to 100, and y is columns so goes from 1 to 2. array(x,1) is the ID column, and array(x,2) is the Value column.
I also need a correlation matrix between each of the 100 observations, say corr(i,j).
I want to produce the sum of the values of the 100 observations allowing for the correlations between them.
I can achieve this by doing matrix multiplication of array(x,2) with itself transposed and with corr(i,j), square-rooted.
If I was working in VBA I may do something like:
Total1 = 0
Total2 = 0
For a = 1 to 100
For b = 1 to 100
result = array(a,2) * array(b,2) * corr(a,b)
Total1 = Total1 + result
Next b
Total2 = Total2 + Total1
Next a
Total = Total2^(1/2)
corr(i,j) needs to be set..
- to 1 where i=j,
- to 0.5 where i <> j and array(i,1) <> array(j,1) i.e. different IDs,
- to 0 otherwise
How can I do this in SAS Enterprise? Is using a Do query the best approach or is there a direct matrix multiplication approach?
Many thanks in advance
Josh
Or if you prefer an equivalent calculation in SQL:
proc sql;
select
sqrt(uss(a.value)/sqrt(count(a.value)) +
sum(a.value*b.value*(a.id < b.id)))
from have as a, have as b;
quit;
Note: USS(x) is the same as SUM(x*x).
Note: In SAS, logical expressions evaluate to 1 when true and 0 otherwise.
Hi @PGStats,
This is ingenious! Somehow I didn't grasp the meaning of those "correlations" (too tired I guess) and thus overlooked the possible simplification. Likewise, the idea of a cross join escaped me. Kudos!
Thank you @FreelanceReinh . It helped that I could validate my results against yours.
@joshmazin wrote:
Will these approaches still work if i have 4 million observations instead of 100?
how long should it take to calculate for 4 million observations?
At the very least you should then take advantage of symmetry. In terms of my solution this could look like this:
data want(keep=total);
array v[&d,2] _temporary_;
set have end=last;
v[_n_,1]=id;
v[_n_,2]=value;
if last;
do i=1 to &d;
total+v[i,2]**2;
do j=i+1 to &d;
if v[i,1] ne v[j,1] then total+v[i,2]*v[j,2];
end;
end;
total=sqrt(total);
run;
But it's well possible that you'll need to further improve efficiency, perhaps by switching to SAS/IML (which I don't have) as suggested by Reeza. In any case I would increase the number n of observations gradually, observe the run time in each step and try to extrapolate (based on the assumption that it will be O(n²)).
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.