I have a time-series variable x as follows.
data series;
do t=1 to 5000;
x=rannor(1);
output;
end;
run;
For each t, I want to estimate x's second moment using observations from 1 to t and a normal kernel as follows.
proc iml;
use series;
read all var _all_;
close;
s2=1:5000;
do s=1 to 5000;
s2[s]=pdf("norm",1:s,s,100)*x[1:s]##2/sum(pdf("norm",1:s,s,100));
end;
create series var{t x s2};
append;
close;
quit;
So I load series and apply this recursive loop since the second moment at t requires observations from 1 to t (x[1:s]) as well as the kernel weighting information (pdf("norm",1:s,s,100)). Is there any way to wisely alter this do loop by its equivalent vectorized method? This way takes too much time for a large T>5000. Thanks.
It is possible to represent these operations in terms of matrix multiplication, but if you have N observations the matrix will be NxN, which will get big. I suspect the main performance bottleneck is not the loop but the fact that you are evaluating the normal PDF so many times. For N observations, you are evaluating the PDF N(N+1)/2 times. This is unnecessary. You only need to evaluate the PDF at N locations. You can then use the properties of the normal PDF to reuse the values. (You also can compute x^2 once instead of multiple times.)
Let f be the normal PDF with standard deviation sigma=100. let f(x; mu) be the PDF for N(mu, 100) evaluated at x.
You are currently evaluating
s=1 f(1; 1)
s=2 f(1; 2) f(2; 2)
s=3 f(1; 3) f(2; 3) f(3; 3)
s=4 f(1; 4) f(2; 4) f(3; 4) f(4; 4)
...
s=N f(1; N) f(2; N) f(3; N) ... f(N; N)
However, notice that f(1; 1) = f(2; 2) = f(3; 3) = .... = f(N; N).
Similarly, f(1; 2) = f(2; 3) = .... = f(N-1; N).
and f(1; 3) = f(2; 4) = ... = f(N-2; N).
In general, the triangular matrix above is banded because the values depend only on the distance between x and N.
If you reuse the PDF values, you can save a lot of time. On my computer, the original loop for N=8000 takes 3.4 seconds whereas the improved loop takes 0.2 seconds, (Only 5% of the original time.)
%let N = 8000;
data series;
do t=1 to &N;
x=rannor(1);
output;
end;
run;
proc iml;
use series;
read all var _all_;
close;
x2 = x##2; /* form x^2 one time */
/* original loop */
t0 = time();
s2 = j(1, &N, .);
do s=1 to &N;
w = pdf("norm",1:s,s,100);
s2[s] = w * x2[1:s] / sum(w);
end;
tLoop = time() - t0;
s0 = s2; /* save the oringal values */
/* new loop */
t0 = time();
w0 = pdf("norm", 1:&N, &N, 100); /* eval the PDF once */
do s = 1 to &N;
w = w0[ , (&N-s+1):&N];
s2[s] = w * x2[1:s] / sum(w);
end;
tVec = time() - t0;
print tLoop tVec;
print (max(abs(s0-s2)))[L="Max Diff"]; /* show that the answers are the same */
It is possible to represent these operations in terms of matrix multiplication, but if you have N observations the matrix will be NxN, which will get big. I suspect the main performance bottleneck is not the loop but the fact that you are evaluating the normal PDF so many times. For N observations, you are evaluating the PDF N(N+1)/2 times. This is unnecessary. You only need to evaluate the PDF at N locations. You can then use the properties of the normal PDF to reuse the values. (You also can compute x^2 once instead of multiple times.)
Let f be the normal PDF with standard deviation sigma=100. let f(x; mu) be the PDF for N(mu, 100) evaluated at x.
You are currently evaluating
s=1 f(1; 1)
s=2 f(1; 2) f(2; 2)
s=3 f(1; 3) f(2; 3) f(3; 3)
s=4 f(1; 4) f(2; 4) f(3; 4) f(4; 4)
...
s=N f(1; N) f(2; N) f(3; N) ... f(N; N)
However, notice that f(1; 1) = f(2; 2) = f(3; 3) = .... = f(N; N).
Similarly, f(1; 2) = f(2; 3) = .... = f(N-1; N).
and f(1; 3) = f(2; 4) = ... = f(N-2; N).
In general, the triangular matrix above is banded because the values depend only on the distance between x and N.
If you reuse the PDF values, you can save a lot of time. On my computer, the original loop for N=8000 takes 3.4 seconds whereas the improved loop takes 0.2 seconds, (Only 5% of the original time.)
%let N = 8000;
data series;
do t=1 to &N;
x=rannor(1);
output;
end;
run;
proc iml;
use series;
read all var _all_;
close;
x2 = x##2; /* form x^2 one time */
/* original loop */
t0 = time();
s2 = j(1, &N, .);
do s=1 to &N;
w = pdf("norm",1:s,s,100);
s2[s] = w * x2[1:s] / sum(w);
end;
tLoop = time() - t0;
s0 = s2; /* save the oringal values */
/* new loop */
t0 = time();
w0 = pdf("norm", 1:&N, &N, 100); /* eval the PDF once */
do s = 1 to &N;
w = w0[ , (&N-s+1):&N];
s2[s] = w * x2[1:s] / sum(w);
end;
tVec = time() - t0;
print tLoop tVec;
print (max(abs(s0-s2)))[L="Max Diff"]; /* show that the answers are the same */
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.
Find more tutorials on the SAS Users YouTube channel.