Creating a 5 month rolling average - vertically or horizontally (obser...

eroka · Posted 09-14-2018 11:39 AM

I have a data set with 3 variables:

1. Regions - a list of 20

2. Months - the data set has one entry per region per calendar month starting in 2001

3. Index Rate - Each Region each month has one index rate.

Instead of using the index rate per month per region, I need to use instead an average of the index rate from the previous two months and the next two months within the data set.

E.g.

For the Region London for March 2017 I need to assign that month not the index rate that is applicable for March 2017, but an average number for the index rates I have for Jan, Feb, March, April and May 2017.

For April 2017 I would need Feb to June 2017, and so forth.

As I get close to the present and I do not have 2 months in the future I will use the last month's index rate one or twice (i.e. for August 2018 I need to use Jule, August and Sept 3 times).

I can do this vertically (i.e. a simple 3 column data set - and then i would need to go up and down the month range) - or if I transpose the data - then with Region on the left column and months as the top headers - it will be horizontally.

The data looks like this:

Date	Region_Name	Index
01/01/2016	City of London	94.01813825
01/02/2016	City of London	97.70019819
01/03/2016	City of London	101.4148808
01/04/2016	City of London	102.7631184
01/05/2016	City of London	107.900446
01/06/2016	City of London	115.8611285
01/07/2016	City of London	116.3539602
01/08/2016	City of London	108.2436922
01/09/2016	City of London	99.59944053
01/10/2016	City of London	101.1481822
01/11/2016	City of London	104.4657271
01/12/2016	City of London	103.5881838
01/01/2017	City of London	104.6752866
01/02/2017	City of London	104.7212834
01/03/2017	City of London	108.9676798
01/04/2017	City of London	115.6290033
01/05/2017	City of London	119.9377345
01/06/2017	City of London	118.7978333

Or like this:

Index_Report_Month	Region	M2018_06	M2018_05	M2018_04	M2018_03	M2018_02	M2018_01
2018-06	BEDS	136.9514	135.1143	135.4214	135.3575	135.7162	134.7327
2018-06	BERKS	123.3576	122.1503	121.1386	121.7117	121.7369	121.9318
2018-06	BRISTOL	129.4606	128.7055	129.0354	129.9676	130.9927	130.939
2018-06	BUCKS	122.2244	121.4971	121.595	122.5949	124.2265	124.7186
2018-06	CAMBS	124.9302	124.736	123.9808	123.4378	123.2373	123.1942
2018-06	CHESHIRE	112.9974	113.5666	113.6045	112.9975	114.1681	114.6474
2018-06	CORNWALL	114.9373	113.9692	112.5068	113.4169	113.5313	113.1154

Can someone please advise how this can be done?

I assume I need to use RETAIN and a macro loop but I am not sure how to do this "moving means", which in excel would be a fairly simple matter.

Much appreciated!

Edo

HB · Posted 09-14-2018 04:06 PM

So the 1/7/2016 figure for London is the average of 107.900446, 115.8611285, 116.3539602, 108.2436922, and 99.59944053?

eroka · Posted 09-17-2018 05:09 AM

That is correct!

Ksharp · Posted 09-15-2018 10:30 AM

Here is an example :

proc sql;
create table want as
 select *,(select sum(index) from have where date between a.date and a.date+30) as rolling_sum
  from have as a;
quit;

mkeintz · Posted 09-15-2018 05:11 PM

Question 1: It appears that you want a column for each month in which a 5-month-centered average is obtainable. I also understand the reason for one row for each region. What exactly is the meaning of the index_report_month?

Question 2: Do you have sas/ets? If so then proc expand is a simple way to get various rolling window statistics, with leading, trailing, or centered windows of user-specified size. For your situation you would run the data (apparently already sorted by region/date) through proc expand with a by region statement, followed by a proc transpose - although it is not apparent why you want a wide file instead of a narrow one.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

eroka · Posted 09-17-2018 05:26 AM

Dear mkeintz,

To answer your questions:

1. The Index Report Month is a string field that I use for each month (format of YYYY-MM) but NOT as a date field. I have found these to be so tricky that it is far better in almost all cases for me to convert the month I am working on to this format. So in this exercise I am relying on Sort as a text - not as a date. Therefore I want to avoid any date calculations as they are really redundant to what I am trying to achieve.

I am under the impression that the solution would be using a "set off" point from the MonthX (=Index_Report_Month) - i.e. climbing up and down two cells from the target month observation itself. I am not familiar how that can be done, as it asking to calculate a mean on 5 target points. I hope you guys can advise on this but if you think the solution will benefit from using a date as a date as a hook - I am not against such thinking.

2. I have Enterprise Guide, and I am not familiar with SAS/ets or Proc Expand.

I tried running an example (http://support.sas.com/documentation/cdl/en/etsug/63939/HTML/default/viewer.htm#etsug_expand_sect034...) but my SAS returned:

ERROR: Procedure EXPAND not found.

The transposed version is easier for a human to read and is more compact on screen, allowing to compare the rates in different region.

I do not mind if the solution will be done vertically (before I transpose the data or after, but there is obviously an advantage to the vertical as then I can create two matrices - one with the rolling average and one with the original data.

I hope this is helpful!

eroka · Posted 09-17-2018 05:35 AM

I forgot to add:

The Index_Report_Month is the month in which the report is ran, so it is irrelevant for process. It's just informs the reader in which month data the data set was correct for. Every month I get a new data set that one more time point added to it - which happened to be two months before the current month (i.e. in August I get the data correct for June, and The Index Report Month would be "2018-06").

eroka · Posted 09-17-2018 09:23 AM

Hi,

I have managed to work out the means logic using LAG statement.

However I am struggling how to implement it within a loop that is limited to the region (i.e to apply this logic only to the BY group of Region_Name.

This I am sure is a fairly simple loop but I am not sure how to loop between By groups.

Your help would be greatly appreciated!

The logic I would want SAS to implement within each Region-Name is as follows:

proc sort data = Test_Lag;
by Region_Name Index_Month ;

Data Test_Lag ;
set Test_Lag ;
X+1 ;
Prev_Rate = Lag (Index);
If Prev_Rate = . then Prev_Rate = Index ;
Prev2_Rate = Lag2 (Index);
If Prev2_Rate = . then Prev2_Rate = Prev_Rate ;

proc sort data = Test_Lag;
by descending x;

Data Test_Lag ;
set Test_Lag ;
Next_Rate = Lag (Index);
If Next_Rate = . then Next_Rate = Index ;
Next2_Rate = Lag2 (Index);
If Next2_Rate = . then Next2_Rate = Next_Rate ;

proc sort data = Test_Lag;
by descending Index_Month;

Data Test_Lag ;
set Test_Lag ;
Mean_5_Months = mean (Prev2_Rate, Prev_Rate, Index, Next_Rate, Next2_Rate );
run;

mkeintz · Posted 09-17-2018 12:29 PM

This problem is ready-made for a 2-dimensional array in which the row dimension is indexed by year and the column dimension is indexed by month.

This program is edited. See what happens when you don't bother making a data step, and I don't test it against data?

data have;
  input Date :ddmmyy10. Region_Name :$15. Index ;
  format date yymmddn8.;
datalines;
01/01/2016 City_of_London 94.01813825 
01/02/2016 City_of_London 97.70019819 
01/03/2016 City_of_London 101.4148808 
01/04/2016 City_of_London 102.7631184 
01/05/2016 City_of_London 107.900446 
01/06/2016 City_of_London 115.8611285 
01/07/2016 City_of_London 116.3539602 
01/08/2016 City_of_London 108.2436922 
01/09/2016 City_of_London 99.59944053 
01/10/2016 City_of_London 101.1481822 
01/11/2016 City_of_London 104.4657271 
01/12/2016 City_of_London 103.5881838 
01/01/2017 City_of_London 104.6752866 
01/02/2017 City_of_London 104.7212834 
01/03/2017 City_of_London 108.9676798 
01/04/2017 City_of_London 115.6290033 
01/05/2017 City_of_London 119.9377345 
01/06/2017 City_of_London 118.7978333 
run;
data want (drop=sum_ix index);
  set have;
  by region_name notsorted;
  array ix_mean {2016:2018,1:12}
           m2016_01-m2016_12  
           m2017_01-m2017_12  
           m2018_01-m2018_12     ;
  retain m20: sum_ix;

  if first.region_name then call missing(sum_ix,of ix_mean{*});
  sum_ix + index - coalesce(lag5(index),0);
  
  if lag4(region_name)=region_name then ix_mean{year(date),month(date)}=sum_ix/5;
  if last.region_name;
run;

data want (drop=sum_ix);
  set have;
  by region notsorted;
  array means {2016:2018,1,12}
           m2016_01-m2016_12  
           m2017_01-m2017_12  
           m2018_01-m2018_12     ;
  retain m20: sum_ix;

  if first.region then call missing(sum_ix,of ix_mean{*});
  sum_ix + index - coalesce(lag5(index),0);
  
    if lag4(region)=region then mean{year(date),month(date)}=sum_index/5;
  if last.regon;
run;

This program assumes that, for each region the data are sorted by date, but the regional data-groups can be in any order (hence the "by region notsorted"). It also assumes you have no holes in the data - otherwise if (say) a record for 2017/04 is missing, then (1) you will have no average for m2017_04, and (2) the results for m2017_05 through m2017_08 will erroneously have a 6-month old component.

Notes:

Be sure to define array bounds to cover all the expected dates.
Since the only time a record is output is when the last record for a region is encountered, the value of DATE will provide the most-recent "vintage" of your data.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

eroka · Posted 09-17-2018 01:12 PM

Hi mkeintz,
Thank you for your suggestion!

My SAS log returns:

26 data Index_MATRIX_201806_Test_1 (drop=sum_ix);
27 set Index_MATRIX_201806_Test;
28 by region notsorted;
29 if first.region then call missing(sum_ix,of ix_mean{*});
ERROR: Undeclared array referenced: ix_mean.
ERROR: The ARRAYNAME[*] specification requires an array.
30 sum_ix + index - coalesce(lag5(index),0);
31
32 array means {2016:2018,1,12}
33 m2016_01-m2016_12
34 m2017_01-m2017_12
35 m2018_01-m2018_12 ;
36 retain m20: ;
37
38 if lag4(region)=region then mean{year(date),month(date)=sum_index/5;
_
388
76
ERROR: Undeclared array referenced: mean.
ERROR 388-185: Expecting an arithmetic operator.

ERROR 76-322: Syntax error, statement will be ignored.

39 if last.regon;
40 run;

Any suggestions around the IX_MEAN and MEAN that are undeclared?...

Edo

eroka · Posted 09-18-2018 07:14 AM

Dear MKeintz,

It doesn't work. The error that I get is

ERROR: Array subscript out of range at line 39 column 41.

If I break the code into two and run the first part:

data Want (drop=sum_ix index);
set Have ;
by region_name notsorted;
array ix_mean {2016:2018,1:12}
m2016_01-m2016_12
m2017_01-m2017_12
m2018_01-m2018_12 ;
retain m20: sum_ix;
run;

Then I get a huge data set that has all the date range you created in the columns (see attached).

The dataset has 23k records (as many as my observation and then columns as many as you have defined (36 I think).

Once I run the second part:

data WANT_1
; set WANT;

if first.region_name then call missing(sum_ix,of ix_mean{*});
sum_ix + index - coalesce(lag5(index),0);
if lag4(region_name)=region_name then ix_mean{year(date),month(date)}=sum_ix/5;
if last.region_name;
run;

Then I get more specific errors:

44 data WANT_1
47 ; set WANT;
49 if first.region_name then call missing(sum_ix,of ix_mean{*});
ERROR: Undeclared array referenced: ix_mean.
ERROR: The ARRAYNAME[*] specification requires an array.
50 sum_ix + index - coalesce(lag5(index),0);
51
52 if lag4(region_name)=region_name then ix_mean{year(date),month(date)}=sum_ix/5;
ERROR: Undeclared array referenced: ix_mean.
ERROR: Variable ix_mean has not been declared as an array.
53 if last.region_name;
54 run;

I am not familiar with working with arrays to be honest so this issue with "undeclared array" is an odd one for me - as I can see that you have declared it in:

array ix_mean {2016:2018,1:12}
m2016_01-m2016_12
m2017_01-m2017_12
m2018_01-m2018_12 ;
retain m20: sum_ix;

Any idea what is going wrong...?

BTW - my data has no missing observations.

Thank you fro your help!

Edo

Creating a 5 month rolling average - vertically or horizontally (observations or variables)

Re: Creating a 5 month rolling average - vertically or horizontally (observations or variables)

Re: Creating a 5 month rolling average - vertically or horizontally (observations or variables)

Re: Creating a 5 month rolling average - vertically or horizontally (observations or variables)

Re: Creating a 5 month rolling average - vertically or horizontally (observations or variables)

Re: Creating a 5 month rolling average - vertically or horizontally (observations or variables)

Re: Creating a 5 month rolling average - vertically or horizontally (observations or variables)

Re: Creating a 5 month rolling average - vertically or horizontally (observations or variables)

Re: Creating a 5 month rolling average - vertically or horizontally (observations or variables)

Re: Creating a 5 month rolling average - vertically or horizontally (observations or variables)

Re: Creating a 5 month rolling average - vertically or horizontally (observations or variables)

SAS Innovate 2025: Call for Content

Classroom Training Available!