Solved: Re: calculate the average for a [+1,+3} and a [-3,-1] window

Zhongda · Posted 09-11-2023 12:21 PM

Hi all,

I am working for question similar to moving average.

for the first obs #1 value 14, I would like to get the number for average for next three values as (6+3+5)/3

which means a [+1,+3} window for the average.

I looked some of the proc expand command but I did not get a solution.

Thanks for the help.

Zhongda

here are the data

# value average for next three values
1 14 4.666666667 =AVERAGE(B3:B5)
2 6 7.333333333 =AVERAGE(B4:B6)
3 3 7.666666667 =AVERAGE(B5:B7)
4 5 11.66666667 =AVERAGE(B6:B8)
5 14 9 =AVERAGE(B7:B9)
6 4 11.33333333 =AVERAGE(B8:B10)
7 17 10.33333333 =AVERAGE(B9:B11)
8 6
9 11
10 14

ballardw · Posted 09-11-2023 12:49 PM

Is that 1 to 10 actually in the data or just part of an example? If not in the data then one approach is to add an order variable like that, sort the data into reverse order, us the LAG function to get values to calculate the mean. The SAS LAG function looks at previous values but there is no automatic look ahead.

Don't include spreadsheet cell references as they confuse the issue. Do include example data of what you actually have in the form of a working data step.

data have;
  input v;
  order = _n_;
datalines;
14
6 
3 
5 
14
4 
17
6
11
14
;

proc sort data=have;
by descending order;
run;

data want;
   set have;
   l1=lag1(v);
   l2=lag2(v);
   l3=lag3(v);
   mean = (l1+l2+l3)/3;
   drop l1-l3;
run;

proc sort data=want;
   by order;
run;

I used the manual mean calculation as your example seems to imply that you do not want the mean unless all three values are non-missing.

View solution in original post

ballardw · Posted 09-11-2023 12:49 PM

Is that 1 to 10 actually in the data or just part of an example? If not in the data then one approach is to add an order variable like that, sort the data into reverse order, us the LAG function to get values to calculate the mean. The SAS LAG function looks at previous values but there is no automatic look ahead.

Don't include spreadsheet cell references as they confuse the issue. Do include example data of what you actually have in the form of a working data step.

data have;
  input v;
  order = _n_;
datalines;
14
6 
3 
5 
14
4 
17
6
11
14
;

proc sort data=have;
by descending order;
run;

data want;
   set have;
   l1=lag1(v);
   l2=lag2(v);
   l3=lag3(v);
   mean = (l1+l2+l3)/3;
   drop l1-l3;
run;

proc sort data=want;
   by order;
run;

I used the manual mean calculation as your example seems to imply that you do not want the mean unless all three values are non-missing.

Zhongda · Posted 09-11-2023 01:04 PM

Hi Ballardw,

Thank you so much for your solution and it perfectly works for a [+1,+3] window. I would like to ask what if I want to calculate another average for window [-3,-1] for each observation? This means, for the first three obs in the sample, this is NA and it should start from the 4th obs and the value will be calculated as (14+6+3)/3

thanks

Reeza · Posted 09-11-2023 01:41 PM

Repeat the process BEFORE the backwards sort to get the values you want.

Zhongda · Posted 09-11-2023 11:24 PM

Thank you!

mkeintz · Posted 09-12-2023 12:23 AM

@Zhongda wrote:

Hi Ballardw,

Thank you so much for your solution and it perfectly works for a [+1,+3] window. I would like to ask what if I want to calculate another average for window [-3,-1] for each observation? This means, for the first three obs in the sample, this is NA and it should start from the 4th obs and the value will be calculated as (14+6+3)/3

thanks

If you want to do the average of the 3 preceding obs as well as the 3 following obs, then you can use a combination of (1) lag functions, (2) firstobs= options, and (3) a conditional merge statement:

data have;
  input v  @@ ;
datalines;
14    6    3    5   14
 4   17    6   11   14
;

data want ;
  set have;
  
  if end_of_leads=0 then 
    merge have (firstobs=2 keep=v rename=(v=_lead1))
          have (firstobs=3 keep=v rename=(v=_lead2))
          have (firstobs=4 keep=v rename=(v=_lead3))
    end=end_of_leads ;
  else call missing(of _lead:);

  average_preceding=ifn(_n_>=4,mean(lag(v),lag2(v),lag3(v)),.);
  if nmiss(of _lead:)=0 then average_following=mean(of _lead:);
run;

Note there is a significant difference between using

 average_preceding=ifn(_n_>=4,mean(lag(v),lag2(v),lag3(v)),.);

and the (likely more intuitive, but wrong)

if _n_>=4 then average_preceding=mean(lag(v),lag2(v),lag3(v));

That's because the lag functions don't look back like an excel spreadsheet, but rather update queues of variable V. And you need to update the queue for every obs, even though you won't use those queue values for observations 1 through 3. The IF statements would fail to update the queues, while the IFN function will update the queue even if the queue values are not chosen iby the IFN condition. But the IFN approach will make the results for observation 4 as expected.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

mkeintz · Posted 09-11-2023 11:58 PM

I see this problem is marked as solved, but there is a far more efficient way to get lead values without the triple step of sorting, generating lags, and then resorting.

You do this as a single step using the firstobs= option for dataset names, and a conditional merge or conditional set statement:

data have;
  input v  @@ ;
datalines;
14    6    3    5   14
 4   17    6   11   14
;

data want (drop=_:);
  set have;
  
  if end_of_leads=0 then 
    merge have (firstobs=2 keep=v rename=(v=_lead1))
          have (firstobs=3 keep=v rename=(v=_lead2))
          have (firstobs=4 keep=v rename=(v=_lead3))
          end=end_of_leads ;
  else call missing(of _lead:);

  if nmiss(of _lead:)=0 then average=mean(of _lead:);
run;

This can save a lot of resources when dealing with large datasets.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Zhongda · Posted 09-12-2023 12:35 AM

Hi mkeintz,

Thanks for bringing this issue out.

The code you provide is good but one thing for my working project is that there are different identifier in different years. While I am using proc sort, I can sort the data into group according to each id and year.

then I can apply the ''lag method'' within each group.

I would like to know how can we achieve this in the method you provide.

Here are the sample dataset, and thank you for your time.

my object is to calculate each id a [+1,+3] window and a [-3,-1] window for the average value.

id year value

1 1990 42

1 1991 16

1 1992 25

1 1993 20

1 1994 34

1 1995 44

1 1996 14

1 1997 35

1 1998 42

1 1999 37

2 1990 36

2 1991 21

2 1992 12

2 1993 40

2 1994 32

2 1995 36

2 1996 31

2 1997 43

2 1998 37

2 1999 33

3 1990 36

3 1991 33

3 1992 30

3 1993 20

3 1994 15

3 1995 34

3 1996 36

3 1997 20

3 1998 37

3 1999 40

mkeintz · Posted 09-12-2023 11:52 AM

It's better to completely define the problem at the start - otherwise there can be mission-creep-fatigue.

BUT ... this is a common problem - getting leads and leads within ID, from a sorted dataset.

data have;
  input id year v  @@;
datalines;
1 1990 42     1 1991 16     1 1992 25     1 1993 20     
1 1994 34     1 1995 44     1 1996 14     1 1997 35     
1 1998 42     1 1999 37     2 1990 36     2 1991 21     
2 1992 12     2 1993 40     2 1994 32     2 1995 36     
2 1996 31     2 1997 43     2 1998 37     2 1999 33     
3 1990 36     3 1991 33     3 1992 30     3 1993 20     
3 1994 15     3 1995 34     3 1996 36     3 1997 20     
3 1998 37     3 1999 40     
run;

data want ;
  set have;
  by id;
  _seq+1;
  if first.id then _seq=1;
  
  if end_of_leads=0 then 
    merge have (firstobs=2 keep=   v rename=(v=_lead1))
          have (firstobs=3 keep=   v rename=(v=_lead2))
          have (firstobs=4 keep=id v rename=(id=_id3 v=_lead3))
    end=end_of_leads ;
  else call missing(of _lead:);
  if _id3^=id then call missing(of _lead:);

  average_preceding=ifn(_seq>=4,mean(lag(v),lag2(v),lag3(v)),.);
  if nmiss(of _lead:)=0 then average_following=mean(of _lead:);
run;

The technique for keeping LAGs only from the same ID is uses the _SEQ variable rather than automatic variable _N_. For keeping LEADs within the same ID, you have to check the ID of the farther lead.

Note the BY statement applies only to the SET statement preceding it. There is no BY attached to the MERGE statement. This is essential, because a MERGE with BY will effectively remove the benefit of using the different FIRSTOBS values, once the second ID is encountered.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sbxkoenk · Posted 09-12-2023 12:41 PM

As the problem is solved already I am not going to work on some example code, but if you have SAS/ETS , I would do this using PROC EXPAND or (when specifications are more difficult) PROC TIMEDATA.

Looking ahead with PROC TIMEDATA is really easy !

Koen

mkeintz · Posted 09-12-2023 08:11 PM

@sbxkoenk wrote:

As the problem is solved already I am not going to work on some example code, but if you have SAS/ETS , I would do this using PROC EXPAND or (when specifications are more difficult) PROC TIMEDATA.

Looking ahead with PROC TIMEDATA is really easy !

Koen

Looking ahead with the DATA step isn't particularly hard either, at least in this case. And it offers a lot more flexibility than proc expand (except for smoothing).

IMO, the main deficiency with PROC EXPAND is that it only does univariate statistics (means, medians, max, std, etc. for single variables). it doesn't even generate a sums-of-cross-products for a rolling window, which means it is no good for direct application to rolling window regressions - a major disappointment.

Does PROC TIMEDATA address that issue?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sbxkoenk · Posted 09-13-2023 07:19 PM

@mkeintz wrote: Does PROC TIMEDATA address that issue?

Yes.
Have a look at this example :

https://go.documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/etsug/etsug_timedata_examples03.htm

In this block of code :

   do i=1 to _length_;
      oilshare[i] = oil[i] / roil[i];
      gasshare[i] = gas[i] / rgas[i];
   end;

, you can use [i] , but also [i+7] or [i-4].
The full time series (within the current by-group) is in-memory, so you jump back and forth in the time series.

Also, you can make an expression which involves value[i-2] for one variable and value[i+2] for another variable.

Koen

mkeintz · Posted 09-13-2023 07:33 PM

Good to know. Thx.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SAS Innovate 2025: Register Now