I have a time-series variable x as follows.
data series;
do t=1 to 5000;
x=rannor(1);
output;
end;
run;
For each t, I want to estimate x's second moment using observations from 1 to t and a normal kernel as follows.
proc iml;
use series;
read all var _all_;
close;
s2=1:5000;
do s=1 to 5000;
s2[s]=pdf("norm",1:s,s,100)*x[1:s]##2/sum(pdf("norm",1:s,s,100));
end;
create series var{t x s2};
append;
close;
quit;
So I load series and apply this recursive loop since the second moment at t requires observations from 1 to t (x[1:s]) as well as the kernel weighting information (pdf("norm",1:s,s,100)). Is there any way to wisely alter this do loop by its equivalent vectorized method? This way takes too much time for a large T>5000. Thanks.
It is possible to represent these operations in terms of matrix multiplication, but if you have N observations the matrix will be NxN, which will get big. I suspect the main performance bottleneck is not the loop but the fact that you are evaluating the normal PDF so many times. For N observations, you are evaluating the PDF N(N+1)/2 times. This is unnecessary. You only need to evaluate the PDF at N locations. You can then use the properties of the normal PDF to reuse the values. (You also can compute x^2 once instead of multiple times.)
Let f be the normal PDF with standard deviation sigma=100. let f(x; mu) be the PDF for N(mu, 100) evaluated at x.
You are currently evaluating
s=1 f(1; 1)
s=2 f(1; 2) f(2; 2)
s=3 f(1; 3) f(2; 3) f(3; 3)
s=4 f(1; 4) f(2; 4) f(3; 4) f(4; 4)
...
s=N f(1; N) f(2; N) f(3; N) ... f(N; N)
However, notice that f(1; 1) = f(2; 2) = f(3; 3) = .... = f(N; N).
Similarly, f(1; 2) = f(2; 3) = .... = f(N-1; N).
and f(1; 3) = f(2; 4) = ... = f(N-2; N).
In general, the triangular matrix above is banded because the values depend only on the distance between x and N.
If you reuse the PDF values, you can save a lot of time. On my computer, the original loop for N=8000 takes 3.4 seconds whereas the improved loop takes 0.2 seconds, (Only 5% of the original time.)
%let N = 8000;
data series;
do t=1 to &N;
x=rannor(1);
output;
end;
run;
proc iml;
use series;
read all var _all_;
close;
x2 = x##2; /* form x^2 one time */
/* original loop */
t0 = time();
s2 = j(1, &N, .);
do s=1 to &N;
w = pdf("norm",1:s,s,100);
s2[s] = w * x2[1:s] / sum(w);
end;
tLoop = time() - t0;
s0 = s2; /* save the oringal values */
/* new loop */
t0 = time();
w0 = pdf("norm", 1:&N, &N, 100); /* eval the PDF once */
do s = 1 to &N;
w = w0[ , (&N-s+1):&N];
s2[s] = w * x2[1:s] / sum(w);
end;
tVec = time() - t0;
print tLoop tVec;
print (max(abs(s0-s2)))[L="Max Diff"]; /* show that the answers are the same */
It is possible to represent these operations in terms of matrix multiplication, but if you have N observations the matrix will be NxN, which will get big. I suspect the main performance bottleneck is not the loop but the fact that you are evaluating the normal PDF so many times. For N observations, you are evaluating the PDF N(N+1)/2 times. This is unnecessary. You only need to evaluate the PDF at N locations. You can then use the properties of the normal PDF to reuse the values. (You also can compute x^2 once instead of multiple times.)
Let f be the normal PDF with standard deviation sigma=100. let f(x; mu) be the PDF for N(mu, 100) evaluated at x.
You are currently evaluating
s=1 f(1; 1)
s=2 f(1; 2) f(2; 2)
s=3 f(1; 3) f(2; 3) f(3; 3)
s=4 f(1; 4) f(2; 4) f(3; 4) f(4; 4)
...
s=N f(1; N) f(2; N) f(3; N) ... f(N; N)
However, notice that f(1; 1) = f(2; 2) = f(3; 3) = .... = f(N; N).
Similarly, f(1; 2) = f(2; 3) = .... = f(N-1; N).
and f(1; 3) = f(2; 4) = ... = f(N-2; N).
In general, the triangular matrix above is banded because the values depend only on the distance between x and N.
If you reuse the PDF values, you can save a lot of time. On my computer, the original loop for N=8000 takes 3.4 seconds whereas the improved loop takes 0.2 seconds, (Only 5% of the original time.)
%let N = 8000;
data series;
do t=1 to &N;
x=rannor(1);
output;
end;
run;
proc iml;
use series;
read all var _all_;
close;
x2 = x##2; /* form x^2 one time */
/* original loop */
t0 = time();
s2 = j(1, &N, .);
do s=1 to &N;
w = pdf("norm",1:s,s,100);
s2[s] = w * x2[1:s] / sum(w);
end;
tLoop = time() - t0;
s0 = s2; /* save the oringal values */
/* new loop */
t0 = time();
w0 = pdf("norm", 1:&N, &N, 100); /* eval the PDF once */
do s = 1 to &N;
w = w0[ , (&N-s+1):&N];
s2[s] = w * x2[1:s] / sum(w);
end;
tVec = time() - t0;
print tLoop tVec;
print (max(abs(s0-s2)))[L="Max Diff"]; /* show that the answers are the same */
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.
Find more tutorials on the SAS Users YouTube channel.