BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Junyong
Pyrite | Level 9

I have a time-series variable x as follows.

data series;
do t=1 to 5000;
x=rannor(1);
output;
end;
run;

For each t, I want to estimate x's second moment using observations from 1 to t and a normal kernel as follows.

proc iml;
use series;
read all var _all_;
close;
s2=1:5000;
do s=1 to 5000;
s2[s]=pdf("norm",1:s,s,100)*x[1:s]##2/sum(pdf("norm",1:s,s,100));
end;
create series var{t x s2};
append;
close;
quit;

So I load series and apply this recursive loop since the second moment at t requires observations from 1 to t (x[1:s]) as well as the kernel weighting information (pdf("norm",1:s,s,100)). Is there any way to wisely alter this do loop by its equivalent vectorized method? This way takes too much time for a large T>5000. Thanks.

1 ACCEPTED SOLUTION

Accepted Solutions
Rick_SAS
SAS Super FREQ

It is possible to represent these operations in terms of matrix multiplication, but if you have N observations the matrix will be NxN, which will get big.  I suspect the main performance bottleneck is not the loop but the fact that you are evaluating the normal PDF so many times. For N observations, you are evaluating the PDF N(N+1)/2 times.  This is unnecessary. You only need to evaluate the PDF at N locations. You can then use the properties of the normal PDF to reuse the values. (You also can compute x^2 once instead of multiple times.)

 

Let f be the normal PDF with standard deviation sigma=100. let f(x; mu) be the PDF for N(mu, 100) evaluated at x.

You are currently evaluating

s=1   f(1; 1)

s=2   f(1; 2)  f(2; 2)

s=3   f(1; 3)  f(2; 3)  f(3; 3)

s=4   f(1; 4)  f(2; 4)  f(3; 4)  f(4; 4)

...

s=N   f(1; N)  f(2; N)  f(3; N)  ...     f(N; N)

 

 

However, notice that f(1; 1) = f(2; 2) = f(3; 3) = .... = f(N; N).

Similarly, f(1; 2) = f(2; 3) = .... = f(N-1; N).

and f(1; 3) = f(2; 4) = ... = f(N-2; N).

In general, the triangular matrix above is banded because the values depend only on the distance between x and N.

 

If you reuse the PDF values, you can save a lot of time. On my computer, the original loop for N=8000 takes 3.4 seconds whereas the improved loop takes 0.2 seconds, (Only 5% of the original time.)

 

 

%let N = 8000;

data series;
do t=1 to &N;
x=rannor(1);
output;
end;
run;


proc iml;
use series;
read all var _all_;
close;

x2 = x##2;      /* form x^2 one time */

/* original loop */
t0 = time();
s2 = j(1, &N, .);
do s=1 to &N;
   w = pdf("norm",1:s,s,100);
   s2[s] = w * x2[1:s] / sum(w);
end;
tLoop = time() - t0;
s0 = s2;     /* save the oringal values */

/* new loop */
t0 = time();
w0 = pdf("norm", 1:&N, &N, 100);  /* eval the PDF once */
do s = 1 to &N;
   w = w0[ , (&N-s+1):&N];
   s2[s] = w * x2[1:s] / sum(w);
end;
tVec = time() - t0;
print tLoop tVec;

print (max(abs(s0-s2)))[L="Max Diff"];  /* show that the answers are the same */

 

 

View solution in original post

1 REPLY 1
Rick_SAS
SAS Super FREQ

It is possible to represent these operations in terms of matrix multiplication, but if you have N observations the matrix will be NxN, which will get big.  I suspect the main performance bottleneck is not the loop but the fact that you are evaluating the normal PDF so many times. For N observations, you are evaluating the PDF N(N+1)/2 times.  This is unnecessary. You only need to evaluate the PDF at N locations. You can then use the properties of the normal PDF to reuse the values. (You also can compute x^2 once instead of multiple times.)

 

Let f be the normal PDF with standard deviation sigma=100. let f(x; mu) be the PDF for N(mu, 100) evaluated at x.

You are currently evaluating

s=1   f(1; 1)

s=2   f(1; 2)  f(2; 2)

s=3   f(1; 3)  f(2; 3)  f(3; 3)

s=4   f(1; 4)  f(2; 4)  f(3; 4)  f(4; 4)

...

s=N   f(1; N)  f(2; N)  f(3; N)  ...     f(N; N)

 

 

However, notice that f(1; 1) = f(2; 2) = f(3; 3) = .... = f(N; N).

Similarly, f(1; 2) = f(2; 3) = .... = f(N-1; N).

and f(1; 3) = f(2; 4) = ... = f(N-2; N).

In general, the triangular matrix above is banded because the values depend only on the distance between x and N.

 

If you reuse the PDF values, you can save a lot of time. On my computer, the original loop for N=8000 takes 3.4 seconds whereas the improved loop takes 0.2 seconds, (Only 5% of the original time.)

 

 

%let N = 8000;

data series;
do t=1 to &N;
x=rannor(1);
output;
end;
run;


proc iml;
use series;
read all var _all_;
close;

x2 = x##2;      /* form x^2 one time */

/* original loop */
t0 = time();
s2 = j(1, &N, .);
do s=1 to &N;
   w = pdf("norm",1:s,s,100);
   s2[s] = w * x2[1:s] / sum(w);
end;
tLoop = time() - t0;
s0 = s2;     /* save the oringal values */

/* new loop */
t0 = time();
w0 = pdf("norm", 1:&N, &N, 100);  /* eval the PDF once */
do s = 1 to &N;
   w = w0[ , (&N-s+1):&N];
   s2[s] = w * x2[1:s] / sum(w);
end;
tVec = time() - t0;
print tLoop tVec;

print (max(abs(s0-s2)))[L="Max Diff"];  /* show that the answers are the same */

 

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Multiple Linear Regression in SAS

Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.

Find more tutorials on the SAS Users YouTube channel.

From The DO Loop
Want more? Visit our blog for more articles like these.
Discussion stats
  • 1 reply
  • 425 views
  • 3 likes
  • 2 in conversation