BookmarkSubscribeRSS Feed
eroka
Calcite | Level 5

I have a data set with 3 variables:

1. Regions - a list of 20

2. Months - the data set has one entry per region per calendar month starting in 2001

3. Index Rate - Each Region each month has one index rate.

 

Instead of using the index rate per month per region, I need to use instead an average of the index rate from the previous two months and the next two months within the data set.

E.g.

For the Region London for March 2017 I need to assign that month not the index rate that is applicable for March 2017, but an average number for the index rates I have for Jan, Feb, March, April and May 2017.

For April 2017 I would need Feb to June 2017, and so forth.

As I get close to the present and I do not have 2 months in the future I will use the last month's index rate one or twice (i.e. for August 2018 I need to use Jule, August and Sept 3 times).

 

I can do this vertically (i.e. a simple 3 column data set - and then i would need to go up and down the month range) - or if I transpose the data - then with Region on the left column and months as the top headers - it will be horizontally.

 

The data looks like this:

 

DateRegion_NameIndex
01/01/2016City of London94.01813825
01/02/2016City of London97.70019819
01/03/2016City of London101.4148808
01/04/2016City of London102.7631184
01/05/2016City of London107.900446
01/06/2016City of London115.8611285
01/07/2016City of London116.3539602
01/08/2016City of London108.2436922
01/09/2016City of London99.59944053
01/10/2016City of London101.1481822
01/11/2016City of London104.4657271
01/12/2016City of London103.5881838
01/01/2017City of London104.6752866
01/02/2017City of London104.7212834
01/03/2017City of London108.9676798
01/04/2017City of London115.6290033
01/05/2017City of London119.9377345
01/06/2017City of London118.7978333

 

Or like this:

Index_Report_MonthRegionM2018_06M2018_05M2018_04M2018_03M2018_02M2018_01
2018-06BEDS136.9514135.1143135.4214135.3575135.7162134.7327
2018-06BERKS123.3576122.1503121.1386121.7117121.7369121.9318
2018-06BRISTOL129.4606128.7055129.0354129.9676130.9927130.939
2018-06BUCKS122.2244121.4971121.595122.5949124.2265124.7186
2018-06CAMBS124.9302124.736123.9808123.4378123.2373123.1942
2018-06CHESHIRE112.9974113.5666113.6045112.9975114.1681114.6474
2018-06CORNWALL114.9373113.9692112.5068113.4169113.5313113.1154

 

Can someone please advise how this can be done?

I assume I need to use RETAIN and a macro loop but I am not sure how to do this "moving means", which in excel would be a fairly simple matter.

 

Much appreciated!

Edo

10 REPLIES 10
HB
Barite | Level 11 HB
Barite | Level 11

So the 1/7/2016 figure for London is the average of 107.900446, 115.8611285, 116.3539602, 108.2436922, and 99.59944053?

 

 



Ksharp
Super User

Here is an example :

 

proc sql;
create table want as
 select *,(select sum(index) from have where date between a.date and a.date+30) as rolling_sum
  from have as a;
quit;
mkeintz
PROC Star

Question 1: It appears that you want a column for each month in which a 5-month-centered average is obtainable.  I also understand the reason for one row for each region.  What exactly is the meaning of the index_report_month?

 

Question 2: Do you have sas/ets?  If so then proc expand is a simple way to get various rolling window statistics, with leading, trailing, or centered windows of user-specified size.  For your situation you would run the data (apparently already sorted by region/date) through proc expand with a by region statement, followed by a proc transpose - although it is not apparent why you want a wide file instead of a narrow one.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
eroka
Calcite | Level 5

Dear mkeintz,

 

To answer your questions:

1. The Index Report Month is a string field that I use for each month (format of YYYY-MM) but NOT as a date field. I have found these to be so tricky that it is far better in almost all cases for me to convert the month I am working on to this format. So in this exercise I am relying on Sort as a text - not as a date. Therefore I want to avoid any date calculations as they are really redundant to what I am trying to achieve.

I am under the impression that the solution would be using a "set off" point from the MonthX (=Index_Report_Month) - i.e. climbing up and down two cells from the target month observation itself. I am not familiar how that can be done, as it asking to calculate a mean on 5 target points. I hope you guys can advise on this but if you think the solution will benefit from using a date as a date as a hook - I am not against such thinking.

 

2. I have Enterprise Guide, and I am not familiar with SAS/ets or Proc Expand.

 

I tried running an example (http://support.sas.com/documentation/cdl/en/etsug/63939/HTML/default/viewer.htm#etsug_expand_sect034...) but my SAS returned:

ERROR: Procedure EXPAND not found.

 

The transposed version is easier for a human to read and is more compact on screen, allowing to compare the rates in different region.

 

I do not mind if the solution will be done vertically (before I transpose the data or after, but there is obviously an advantage to the vertical as then I can create two matrices - one with the rolling average and one with the original data.

 

I hope this is helpful!

 

eroka
Calcite | Level 5

I forgot to add:

The Index_Report_Month is the month in which the report is ran, so it is irrelevant for process. It's just informs the reader in which month data the data set was correct for. Every month I get a new data set that one more time point added to it - which happened to be two months before the current month (i.e. in August I get the data correct for June, and The Index Report Month would be "2018-06").

eroka
Calcite | Level 5

Hi,

I have managed to work out the means logic using LAG statement.

However I am struggling how to implement it within a loop that is limited to the region (i.e to apply this logic only to the BY group of Region_Name.

This I am sure is a fairly simple loop but I am not sure how to loop between By groups.

Your help would be greatly appreciated!

 

 

The logic I would want SAS to implement within each Region-Name is as follows:

 

proc sort data = Test_Lag;
by Region_Name Index_Month ;

Data Test_Lag ;
set Test_Lag ;
X+1 ;
Prev_Rate = Lag (Index);
If Prev_Rate = . then Prev_Rate = Index ;
Prev2_Rate = Lag2 (Index);
If Prev2_Rate = . then Prev2_Rate = Prev_Rate ;

proc sort data = Test_Lag;
by descending x;

 

Data Test_Lag ;
set Test_Lag ;
Next_Rate = Lag (Index);
If Next_Rate = . then Next_Rate = Index ;
Next2_Rate = Lag2 (Index);
If Next2_Rate = . then Next2_Rate = Next_Rate ;

proc sort data = Test_Lag;
by descending Index_Month;

Data Test_Lag ;
set Test_Lag ;
Mean_5_Months = mean (Prev2_Rate, Prev_Rate, Index, Next_Rate, Next2_Rate );
run;

mkeintz
PROC Star

This problem is ready-made for a 2-dimensional array in which the row dimension is indexed by year and the column dimension is indexed by month.

 

This program is edited. See what happens when you don't bother making a data step, and I don't test it against data?

 

data have;
  input Date :ddmmyy10. Region_Name :$15. Index ;
  format date yymmddn8.;
datalines;
01/01/2016 City_of_London 94.01813825 
01/02/2016 City_of_London 97.70019819 
01/03/2016 City_of_London 101.4148808 
01/04/2016 City_of_London 102.7631184 
01/05/2016 City_of_London 107.900446 
01/06/2016 City_of_London 115.8611285 
01/07/2016 City_of_London 116.3539602 
01/08/2016 City_of_London 108.2436922 
01/09/2016 City_of_London 99.59944053 
01/10/2016 City_of_London 101.1481822 
01/11/2016 City_of_London 104.4657271 
01/12/2016 City_of_London 103.5881838 
01/01/2017 City_of_London 104.6752866 
01/02/2017 City_of_London 104.7212834 
01/03/2017 City_of_London 108.9676798 
01/04/2017 City_of_London 115.6290033 
01/05/2017 City_of_London 119.9377345 
01/06/2017 City_of_London 118.7978333 
run;
data want (drop=sum_ix index);
  set have;
  by region_name notsorted;
  array ix_mean {2016:2018,1:12}
           m2016_01-m2016_12  
           m2017_01-m2017_12  
           m2018_01-m2018_12     ;
  retain m20: sum_ix;

  if first.region_name then call missing(sum_ix,of ix_mean{*});
  sum_ix + index - coalesce(lag5(index),0);
  
  if lag4(region_name)=region_name then ix_mean{year(date),month(date)}=sum_ix/5;
  if last.region_name;
run;

 

 

data want (drop=sum_ix);
  set have;
  by region notsorted;
array means {2016:2018,1,12}
           m2016_01-m2016_12 
           m2017_01-m2017_12 
           m2018_01-m2018_12     ;
  retain m20: sum_ix;
if first.region then call missing(sum_ix,of ix_mean{*}); sum_ix + index - coalesce(lag5(index),0); if lag4(region)=region then mean{year(date),month(date)}=sum_index/5; if last.regon; run;

 

This program assumes that, for each region the data are sorted by date, but the regional data-groups can be in any order (hence the "by region notsorted").  It also assumes you have no holes in the data - otherwise if (say) a record for 2017/04 is missing, then (1) you will have no average for  m2017_04, and (2) the results for m2017_05 through m2017_08 will erroneously have a 6-month old component.

 

Notes:

  1. Be sure to define array bounds to cover all the expected dates.
  2. Since the only time a record is output is when the last record for a region is encountered, the value of DATE will provide the most-recent "vintage" of your data.
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
eroka
Calcite | Level 5
Hi mkeintz,
Thank you for your suggestion!

My SAS log returns:

26 data Index_MATRIX_201806_Test_1 (drop=sum_ix);
27 set Index_MATRIX_201806_Test;
28 by region notsorted;
29 if first.region then call missing(sum_ix,of ix_mean{*});
ERROR: Undeclared array referenced: ix_mean.
ERROR: The ARRAYNAME[*] specification requires an array.
30 sum_ix + index - coalesce(lag5(index),0);
31
32 array means {2016:2018,1,12}
33 m2016_01-m2016_12
34 m2017_01-m2017_12
35 m2018_01-m2018_12 ;
36 retain m20: ;
37
38 if lag4(region)=region then mean{year(date),month(date)=sum_index/5;
_
388
76
ERROR: Undeclared array referenced: mean.
ERROR 388-185: Expecting an arithmetic operator.

ERROR 76-322: Syntax error, statement will be ignored.

39 if last.regon;
40 run;

Any suggestions around the IX_MEAN and MEAN that are undeclared?...

Edo
eroka
Calcite | Level 5

Dear MKeintz,

 

It doesn't work. The error that I get is 

ERROR: Array subscript out of range at line 39 column 41.

 

If I break the code into two and run the first part:

 

data Want (drop=sum_ix index);
set Have ;
by region_name notsorted;
array ix_mean {2016:2018,1:12}
m2016_01-m2016_12
m2017_01-m2017_12
m2018_01-m2018_12 ;
retain m20: sum_ix;
run;

 

Then I get a huge data set that has all the date range you created in the columns (see attached).

The dataset has 23k records (as many as my observation and then columns as many as you have defined (36 I think).

 

 

Once I run the second part:

 

data WANT_1
; set WANT;

if first.region_name then call missing(sum_ix,of ix_mean{*});
sum_ix + index - coalesce(lag5(index),0);
if lag4(region_name)=region_name then ix_mean{year(date),month(date)}=sum_ix/5;
if last.region_name;
run;

 

Then I get more specific errors:

44 data WANT_1
47 ; set WANT;
49 if first.region_name then call missing(sum_ix,of ix_mean{*});
ERROR: Undeclared array referenced: ix_mean.
ERROR: The ARRAYNAME[*] specification requires an array.
50 sum_ix + index - coalesce(lag5(index),0);
51
52 if lag4(region_name)=region_name then ix_mean{year(date),month(date)}=sum_ix/5;
ERROR: Undeclared array referenced: ix_mean.
ERROR: Variable ix_mean has not been declared as an array.
53 if last.region_name;
54 run;

 

I am not familiar with working with arrays to be honest so this issue with "undeclared array" is an odd one for me - as I can see that you have declared it in:

 

array ix_mean {2016:2018,1:12}
m2016_01-m2016_12 
m2017_01-m2017_12 
m2018_01-m2018_12 ;
retain m20: sum_ix;

 

Any idea what is going wrong...?

BTW - my data has no missing observations.

 

Thank you fro your help!

 

Edo

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 2038 views
  • 0 likes
  • 4 in conversation