BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Zhongda
Fluorite | Level 6

Hi all,

I am working for question similar to moving average.

for the first obs #1 value 14, I would like to get the number for average for next three values as (6+3+5)/3

which means a [+1,+3} window for the average.

I looked some of the proc expand command but I did not get a solution.

Thanks for the help.

Zhongda

here are the data

# value average for next three values
1 14 4.666666667 =AVERAGE(B3:B5)
2 6 7.333333333 =AVERAGE(B4:B6)
3 3 7.666666667 =AVERAGE(B5:B7)
4 5 11.66666667 =AVERAGE(B6:B8)
5 14 9 =AVERAGE(B7:B9)
6 4 11.33333333 =AVERAGE(B8:B10)
7 17 10.33333333 =AVERAGE(B9:B11)
8 6
9 11
10 14

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

Is that 1 to 10 actually in the data or just part of an example? If not in the data then one approach is to add an order variable like that, sort the data into reverse order, us the LAG function to get values to calculate the mean. The SAS LAG function looks at previous values but there is no automatic look ahead.

Don't include spreadsheet cell references as they confuse the issue. Do include example data of what you actually have in the form of a working data step.

 

data have;
  input v;
  order = _n_;
datalines;
14
6 
3 
5 
14
4 
17
6
11
14
;

proc sort data=have;
by descending order;
run;

data want;
   set have;
   l1=lag1(v);
   l2=lag2(v);
   l3=lag3(v);
   mean = (l1+l2+l3)/3;
   drop l1-l3;
run;

proc sort data=want;
   by order;
run;


I used the manual mean calculation as your example seems to imply that you do not want the mean unless all three values are non-missing.

View solution in original post

12 REPLIES 12
ballardw
Super User

Is that 1 to 10 actually in the data or just part of an example? If not in the data then one approach is to add an order variable like that, sort the data into reverse order, us the LAG function to get values to calculate the mean. The SAS LAG function looks at previous values but there is no automatic look ahead.

Don't include spreadsheet cell references as they confuse the issue. Do include example data of what you actually have in the form of a working data step.

 

data have;
  input v;
  order = _n_;
datalines;
14
6 
3 
5 
14
4 
17
6
11
14
;

proc sort data=have;
by descending order;
run;

data want;
   set have;
   l1=lag1(v);
   l2=lag2(v);
   l3=lag3(v);
   mean = (l1+l2+l3)/3;
   drop l1-l3;
run;

proc sort data=want;
   by order;
run;


I used the manual mean calculation as your example seems to imply that you do not want the mean unless all three values are non-missing.

Zhongda
Fluorite | Level 6

Hi Ballardw, 

 

Thank you so much for your solution and it perfectly works for a [+1,+3] window. I would like to ask what if I want to calculate another average for window [-3,-1] for each observation? This means, for the first three obs in the sample, this is NA and it should start from the 4th obs and the value will be calculated as (14+6+3)/3

thanks

 

14
6 
3 
5 
14
4 
17
6
11
14

 

 

Reeza
Super User
Repeat the process BEFORE the backwards sort to get the values you want.
mkeintz
PROC Star

@Zhongda wrote:

Hi Ballardw, 

 

Thank you so much for your solution and it perfectly works for a [+1,+3] window. I would like to ask what if I want to calculate another average for window [-3,-1] for each observation? This means, for the first three obs in the sample, this is NA and it should start from the 4th obs and the value will be calculated as (14+6+3)/3

thanks


If you want to do the average of the 3 preceding obs as well as the 3 following obs, then you can use a combination of (1) lag functions,  (2) firstobs= options, and (3) a conditional merge statement:

 

data have;
  input v  @@ ;
datalines;
14    6    3    5   14
 4   17    6   11   14
;

data want ;
  set have;
  
  if end_of_leads=0 then 
    merge have (firstobs=2 keep=v rename=(v=_lead1))
          have (firstobs=3 keep=v rename=(v=_lead2))
          have (firstobs=4 keep=v rename=(v=_lead3))
    end=end_of_leads ;
  else call missing(of _lead:);

  average_preceding=ifn(_n_>=4,mean(lag(v),lag2(v),lag3(v)),.);
  if nmiss(of _lead:)=0 then average_following=mean(of _lead:);
run;

Note there is a significant difference between using

 average_preceding=ifn(_n_>=4,mean(lag(v),lag2(v),lag3(v)),.);

and the (likely more intuitive, but wrong)

if _n_>=4 then average_preceding=mean(lag(v),lag2(v),lag3(v));

That's because the lag functions don't look back like an excel spreadsheet, but rather update queues of variable V.  And you need to update the queue for every obs, even though you won't use those queue values for observations 1 through 3.   The IF statements would fail to update the queues, while the IFN function will update the queue even if the queue values are not chosen iby the IFN condition.   But the IFN approach will make the results for observation 4 as expected.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
mkeintz
PROC Star

I see this problem is marked as solved, but there is a far more efficient way to get lead values without the triple step of sorting, generating lags, and then resorting.

 

You do this as a single step using the firstobs= option for dataset names, and a conditional merge or conditional set statement:

 

data have;
  input v  @@ ;
datalines;
14    6    3    5   14
 4   17    6   11   14
;

data want (drop=_:);
  set have;
  
  if end_of_leads=0 then 
    merge have (firstobs=2 keep=v rename=(v=_lead1))
          have (firstobs=3 keep=v rename=(v=_lead2))
          have (firstobs=4 keep=v rename=(v=_lead3))
          end=end_of_leads ;
  else call missing(of _lead:);

  if nmiss(of _lead:)=0 then average=mean(of _lead:);
run;

This can save a lot of resources when dealing with large datasets.  

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Zhongda
Fluorite | Level 6

Hi mkeintz, 

Thanks for bringing this issue out.

The code you provide is good but one thing for my working project is that there are different identifier in different years. While I am using proc sort, I can sort the data into group according to each id and year.

then I can apply the ''lag method'' within each group.

I would like to know how can we achieve this in the method you provide.

Here are the sample dataset, and thank you for your time.

my object is to calculate each id a [+1,+3] window and a [-3,-1] window for the average value.

id year value
1 1990 42
1 1991 16
1 1992 25
1 1993 20
1 1994 34
1 1995 44
1 1996 14
1 1997 35
1 1998 42
1 1999 37
2 1990 36
2 1991 21
2 1992 12
2 1993 40
2 1994 32
2 1995 36
2 1996 31
2 1997 43
2 1998 37
2 1999 33
3 1990 36
3 1991 33
3 1992 30
3 1993 20
3 1994 15
3 1995 34
3 1996 36
3 1997 20
3 1998 37
3 1999 40
mkeintz
PROC Star

It's better to completely define the problem at the start - otherwise there can be mission-creep-fatigue.

 

BUT ... this is a common problem - getting leads and leads within ID, from a sorted dataset.

 

data have;
  input id year v  @@;
datalines;
1 1990 42     1 1991 16     1 1992 25     1 1993 20     
1 1994 34     1 1995 44     1 1996 14     1 1997 35     
1 1998 42     1 1999 37     2 1990 36     2 1991 21     
2 1992 12     2 1993 40     2 1994 32     2 1995 36     
2 1996 31     2 1997 43     2 1998 37     2 1999 33     
3 1990 36     3 1991 33     3 1992 30     3 1993 20     
3 1994 15     3 1995 34     3 1996 36     3 1997 20     
3 1998 37     3 1999 40     
run;

data want ;
  set have;
  by id;
  _seq+1;
  if first.id then _seq=1;
  
  if end_of_leads=0 then 
    merge have (firstobs=2 keep=   v rename=(v=_lead1))
          have (firstobs=3 keep=   v rename=(v=_lead2))
          have (firstobs=4 keep=id v rename=(id=_id3 v=_lead3))
    end=end_of_leads ;
  else call missing(of _lead:);
  if _id3^=id then call missing(of _lead:);

  average_preceding=ifn(_seq>=4,mean(lag(v),lag2(v),lag3(v)),.);
  if nmiss(of _lead:)=0 then average_following=mean(of _lead:);
run;

The technique for keeping LAGs only from the same ID is uses the _SEQ variable rather than automatic variable _N_.   For keeping LEADs within the same ID, you have to check the ID of the farther lead.

 

Note the BY statement applies only to the SET statement preceding it.  There is no BY attached to the MERGE statement.  This is essential, because a MERGE with BY will effectively remove the benefit of using the different FIRSTOBS values, once the second ID is encountered.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
sbxkoenk
SAS Super FREQ

As the problem is solved already I am not going to work on some example code, but if you have SAS/ETS , I would do this using PROC EXPAND or (when specifications are more difficult) PROC TIMEDATA.

 

Looking ahead with PROC TIMEDATA is really easy !

 

Koen

mkeintz
PROC Star

@sbxkoenk wrote:

As the problem is solved already I am not going to work on some example code, but if you have SAS/ETS , I would do this using PROC EXPAND or (when specifications are more difficult) PROC TIMEDATA.

 

Looking ahead with PROC TIMEDATA is really easy !

 

Koen


Looking ahead with the DATA step isn't particularly hard either, at least in this case.  And it offers a lot more flexibility than proc expand (except for smoothing).

 

IMO, the main deficiency with PROC EXPAND is that it only does univariate statistics (means, medians, max, std, etc. for single variables).  it doesn't even generate a sums-of-cross-products for a rolling window, which means it is no good for direct application to rolling window regressions - a major disappointment.

 

Does PROC TIMEDATA address that issue?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
sbxkoenk
SAS Super FREQ

@mkeintz wrote: Does PROC TIMEDATA address that issue?

Yes.
Have a look at this example :

https://go.documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/etsug/etsug_timedata_examples03.htm

 

In this block of code :

   do i=1 to _length_;
      oilshare[i] = oil[i] / roil[i];
      gasshare[i] = gas[i] / rgas[i];
   end;

, you can use [i] , but also [i+7] or [i-4].
The full time series (within the current by-group) is in-memory, so you jump back and forth in the time series.

Also, you can make an expression which involves value[i-2] for one variable and value[i+2] for another variable.

 

Koen

mkeintz
PROC Star
Good to know. Thx.
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 12 replies
  • 1801 views
  • 0 likes
  • 5 in conversation