Hi
I am familiar with SQL and VBA, and now trying to learn how to use SAS. I would appreciate your help.
I have a data set that is two columns: column ID (set to 1 or 2) and column Value.
Let's say I have 100 observations.
Let's call this data set as array(x,y) where x is rows so goes from 1 to 100, and y is columns so goes from 1 to 2. array(x,1) is the ID column, and array(x,2) is the Value column.
I also need a correlation matrix between each of the 100 observations, say corr(i,j).
I want to produce the sum of the values of the 100 observations allowing for the correlations between them.
I can achieve this by doing matrix multiplication of array(x,2) with itself transposed and with corr(i,j), square-rooted.
If I was working in VBA I may do something like:
Total1 = 0
Total2 = 0
For a = 1 to 100
For b = 1 to 100
result = array(a,2) * array(b,2) * corr(a,b)
Total1 = Total1 + result
Next b
Total2 = Total2 + Total1
Next a
Total = Total2^(1/2)
corr(i,j) needs to be set..
- to 1 where i=j,
- to 0.5 where i <> j and array(i,1) <> array(j,1) i.e. different IDs,
- to 0 otherwise
How can I do this in SAS Enterprise? Is using a Do query the best approach or is there a direct matrix multiplication approach?
Many thanks in advance
Josh
Hello @joshmazin and welcome to the SAS Support Communities!
I was wondering if Total1 in your VBA code doesn't need to be reset to 0 after each iteration of the outer loop. Assuming that this should be the case, the computation could be coded in SAS like this:
/* Create sample data for demonstration */
data have;
call streaminit(27182818);
do _n_=1 to 100;
id=rand('table',0.5);
value=rand('integer',1000);
output;
end;
run;
/* Determine number of observations in HAVE */
data _null_;
call symputx('d',n);
stop;
set have nobs=n;
run;
/* Perform the computation */
data want(keep=total);
array v[&d,2] _temporary_;
set have end=last;
v[_n_,1]=id;
v[_n_,2]=value;
if last;
do i=1 to &d;
do j=1 to &d;
if i=j then total+v[i,2]*v[j,2];
else if v[i,1] ne v[j,1] then total+v[i,2]*v[j,2]/2;
end;
end;
total=sqrt(total);
run;
The last DATA step writes the IDs and values to a temporary array v, then computes the sum of all products of the form "value1*value2*corr" and finally writes the square root of that total to dataset WANT.
Thank you very much. I will try to apply this to my full code and will update you
If you're looking for a good reference I highly recommend this one:
https://www.amazon.ca/Simulating-Data-SAS-Rick-Wicklin/dp/1612903320
His blog posts will also have a lot of good posts on simulating data in SAS. Typically, if you're doing that type of work you're usually using IML, not Base SAS.
https://blogs.sas.com/content/author/rickwicklin/
@joshmazin wrote:
Hi
I am familiar with SQL and VBA, and now trying to learn how to use SAS. I would appreciate your help.
I have a data set that is two columns: column ID (set to 1 or 2) and column Value.
Let's say I have 100 observations.
Let's call this data set as array(x,y) where x is rows so goes from 1 to 100, and y is columns so goes from 1 to 2. array(x,1) is the ID column, and array(x,2) is the Value column.
I also need a correlation matrix between each of the 100 observations, say corr(i,j).
I want to produce the sum of the values of the 100 observations allowing for the correlations between them.
I can achieve this by doing matrix multiplication of array(x,2) with itself transposed and with corr(i,j), square-rooted.
If I was working in VBA I may do something like:
Total1 = 0
Total2 = 0
For a = 1 to 100
For b = 1 to 100
result = array(a,2) * array(b,2) * corr(a,b)
Total1 = Total1 + result
Next b
Total2 = Total2 + Total1
Next a
Total = Total2^(1/2)
corr(i,j) needs to be set..
- to 1 where i=j,
- to 0.5 where i <> j and array(i,1) <> array(j,1) i.e. different IDs,
- to 0 otherwise
How can I do this in SAS Enterprise? Is using a Do query the best approach or is there a direct matrix multiplication approach?
Many thanks in advance
Josh
Or if you prefer an equivalent calculation in SQL:
proc sql;
select
sqrt(uss(a.value)/sqrt(count(a.value)) +
sum(a.value*b.value*(a.id < b.id)))
from have as a, have as b;
quit;
Note: USS(x) is the same as SUM(x*x).
Note: In SAS, logical expressions evaluate to 1 when true and 0 otherwise.
Hi @PGStats,
This is ingenious! Somehow I didn't grasp the meaning of those "correlations" (too tired I guess) and thus overlooked the possible simplification. Likewise, the idea of a cross join escaped me. Kudos!
Thank you @FreelanceReinh . It helped that I could validate my results against yours.
@joshmazin wrote:
Will these approaches still work if i have 4 million observations instead of 100?
how long should it take to calculate for 4 million observations?
At the very least you should then take advantage of symmetry. In terms of my solution this could look like this:
data want(keep=total);
array v[&d,2] _temporary_;
set have end=last;
v[_n_,1]=id;
v[_n_,2]=value;
if last;
do i=1 to &d;
total+v[i,2]**2;
do j=i+1 to &d;
if v[i,1] ne v[j,1] then total+v[i,2]*v[j,2];
end;
end;
total=sqrt(total);
run;
But it's well possible that you'll need to further improve efficiency, perhaps by switching to SAS/IML (which I don't have) as suggested by Reeza. In any case I would increase the number n of observations gradually, observe the run time in each step and try to extrapolate (based on the assumption that it will be O(n²)).
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.