BookmarkSubscribeRSS Feed
John178
Calcite | Level 5

John178_0-1673310970637.png

 

 

I have  data  like this one

Calculate sum of 250 lags in SAS.

 I want to do sum of 250 previous rows for each row, starting from the row 250th.

X= lag1(VWRETD)+ lag2(VWRETD)+ ... +lag250(VWRETD)

or X = sum ( lag1(VWRETD), lag2(VWRETD), ... ,lag250(VWRETD) )

 

I also want to calculate sum of 250 next rows after each row.

I try to use lag function but it does not work for too many lags;

thank you for your help!

 

9 REPLIES 9
Astounding
PROC Star

There are a lot of devils and a lot of details.  For example, I would assume you mean starting with the 251st observation, not starting with the 250th observation.  Otherwise there can't be 250 previous values to sum.

 

Here's an approach that takes a long time to run.  It has to execute 500 SET statements for every observation.  I imagine this is time-consuming, but you can always try it and see.

 

Then there are details about cutting off the end of the data set once you have fewer than 250 "post" observations left to sum.

 

At any rate, I have some ideas in mind to cut down on the run time.  So if the problem is not solved by tomorrow I will revisit it and share more.  


Here is one basic (long-running) approach:

 

data want;
   set have nobs=totalobs;
   if 251 <= _n_ <= totalobs - 250;
   prior250 = 0;
   do obsno = _n_ - 250 to _n_ - 1;
      set have (keep=vwretd rename=(vwretd=prior_value)) point=obsno;
      prior250 + prior_value;
   end;
   post250 = 0;
   do obsno = _n_ + 1 to _n_ + 250;
      set have (keep=vwretd rename=(vwretd=post_value)) point=obsno;
      post250 + post_value;
   end;
   drop prior_value post_value obsno;
run;
Astounding
PROC Star

As others have mentioned, you can set up an array to track the prior 250 values.  That's probably the fastest way (don't know about the ETS solution). 

 

Here's a way to address the subsequent 250 values, using common tools.  It can be combined with a solution for the previous 250 values, but that would get kind of messy to present in the same DATA step.  So I leave to you the combination of prior and post.  Anyway, here's the way to get the post-250 sum.  It won't be that fast, but it won't be superslow either (like my original post was).

 

data want;
   pre250 = 0;
   post250 = 0;
retain pre250 post250; if _n_=1 then do until (done); set have (firstobs=251 obs=500 keep=vwretd) end=done; post250 + vwretd; end; set have nobs = totalobs; pre250 + vwretd; /* for clarity and focus, omit logic for calculating */ /* previous 250 sum but it could be re-inserted here */ if _n_ >= 251 and _n_ <= totalobs - 250 then do; post250 = post250 - vwretd; obsno = _n_ + 250; set have point=obsno (keep=vwretd rename = (vwretd = increment) ); post250 = post250 + increment; end; run;

It needs to execute a SET statement to get the value 251 observations down the road, and use it to replace the current value of VWRETD.

 

It's untested code since you have the data and we don't.  But it looks like it should work.

 

Good luck.

 

This solution WILL ALSO BE REVISED to make it run faster by eliminating point= ... as soon as I have the time to devote to it.  Definitely can be done.

 

In retrospect, the posted solution from @s_lassen is close enough to the revision I would have made.  So no more updates coming from me.

PGStats
Opal | Level 21

Here is an example using proc expand with 5 lags

 

data have(drop=i);
call streaminit(86589);
do i = 1 to 25;
    x = rand("uniform");
    output;
    end;
run;

proc expand data=have out=want(drop=TIME);
convert x=x_sum5 / transformout = (movsum 5 trimleft 4);
convert x=x_rev5 / transformout = (reverse movsum 5 reverse trimright 4);
run;

PGStats_0-1673322140220.png

proc expand is part of SAS/ETS, SAS econometrics and time series procedures.

PG
andreas_lds
Jade | Level 19

This should work, (EDIT:but only for the first part of your task):

 

data want;
   set have;
   
   length lag_sum v0 - v249 8;
   retain v0 - v249;
   array lags[0:249] v0 - v249;
   
   if _n_ > dim(lags) then do;
      lag_sum = sum(of lags[*]);
   end;
  
   lags[mod(_n_ - 1, dim(lags))] = value;
   
   drop v0 - v249;
run;

 

Kurt_Bremser
Super User

Create a temporary array with 250 elements.

Use mod(_n_,250) to determine which element to set.

Once _n_ is larger than 250, use

sum(of array{*})

to do the calculation.

 

For code, provide usable example data in a data step with datalines.

Tom
Super User Tom
Super User

You do not need to call LAG() 250 times.

Just build a running total and subtract the value that is being bumped off the back.

data want;
   set have;
   running_sum + VWRETD- sum(0,lag250(VWRETD));
   if _n_ >= 250 then x=running_sum;
run;   

To calculate for the following 250 values just sort the data in descending order and to the same thing.

Astounding
PROC Star

@Tom of course you are right for the previous 250.  Last time I read the documentation, LAG was limited to LAG100, and LAG250 was not possible.  At any rate, getting the subsequent 250 values is going to be the difficult part.  Even after two tries, I will need to revise my last solution for that to eliminate point=.

s_lassen
Meteorite | Level 14

Here is a solution based on the post by @andreas_lds , but expanded to include the look-ahead sums as well:

data want;
  array next_val (0:249) 8 _temporary_;
  array prev_val (0:249) 8 _temporary_;
  if _N_=1 then do k=1 to 249 while (not done);
    set have end=done;
    next_val(k)=vwretd;
    end;
  if _N_<=nobs-249 then do;
    set have(keep=vwretd firstobs=250);
    next_val(mod(_N_-1,250))=vwretd;
    next_sum=sum(of next_val(*));
    end;
  set have nobs=nobs;
  prev_val(mod(_N_,250))=vwretd;
  if _N_>=250 then
    prev_sum=sum(of prev_val(*));
  drop k;
run;

The solution is made on the assumption that you want the current observation included in both the look-ahead and the look-behind, otherwise you will have to adjust.

mkeintz
PROC Star

@Tom's solution structure is the most appropriate for this problem.  

 

My question is this.  You are using variable VWRETD, suggesting to me that you are using CRSP data, in particular value-weighted returns with dividends.

 

Your stated request is for a rolling 250-period sum of VWRETD, with each rolling window ending at the immediately preceding observation.  But instead of the sum, do you actually want the 250-period compounded return?  If so, then you really want to add the logs of VWRETD+1 and then take the antilog (and subtract 1) to get the compound return.  That would look like the below: 

 

data want (drop=_:);
  set have;
  retain _roll_sum_log_ret_plus1 0;

  rolling_preceding_250_return = exp(_roll_sum_log_ret_plus1)-1;

  _roll_sum_log_ret_plus1 = _roll_sum_log_ret_plus1 + log(1+vwretd) -1*sum(0,log(lag250(1+vwretd)));

  if _n>250 ;
run;

Note that the sum of the logs is updated after the compound return is calculated.  This means the calculated compound return does not include the current observation.

 

There can be a problem with this technique.  This program, like @Tom's, constantly updates a rolling total, which means any numeric precision issues, once encountered, can be preserved in subsequent rolling windows, even non-overlapping windows.  Solutions that sum arrays of 250 do not have that property.  But this solution is much faster, and I doubt that the CRSP data is so pathological as to cause more than a minor difference in results.   You can test this assertion by running the data step on your entire dataset, and then only using the last 251 (at least) observations in the dataset.  Compare results for corresponding dates.  Note this is NOT a SAS problem - it is a numeric precision problem shared by all digital computers.

 

Editted note:  I forgot that you also want the next 250 observations.  I.e. for observation 251, you want the compounded return for obs 1 through 250, and also the compounded return for obs 252 through 501.  Here's a program that does so:

 


data want (drop=_:);
  if 0 then set have nobs=nperiods ;

  merge have  have (firstobs=252 keep=vwretd rename=(vwretd=_wretd));

  retain _roll_sum_log_ret_plus1 _roll_nxt_log_ret_plus1 0;

  rolling_preceding_250_return = exp(_roll_sum_log_ret_plus1) - 1;
  rolling_following_250_return = exp(_roll_nxt_log_ret_plus1) - 1;

  _roll_sum_log_ret_plus1 = _roll_sum_log_ret_plus1 + log(1+vwretd) -1*sum(0,log(lag250(1+vwretd)));
  _roll_nxt_log_ret_plus1 = _roll_nxt_log_ret_plus1 + log(1+_wretd) -1*sum(0,log(lag250(1+_wretd)));

  if 250< _n_ < nperiods-250;
run;

This produces results only for those dates that have both a complete window for the preceding 250 observations (i.e. _N_>250) and a complete window for the following 250 observations (_N_ < nperiods-250).  

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 2204 views
  • 8 likes
  • 8 in conversation