I have a data set with 3 variables:
1. Regions - a list of 20
2. Months - the data set has one entry per region per calendar month starting in 2001
3. Index Rate - Each Region each month has one index rate.
Instead of using the index rate per month per region, I need to use instead an average of the index rate from the previous two months and the next two months within the data set.
E.g.
For the Region London for March 2017 I need to assign that month not the index rate that is applicable for March 2017, but an average number for the index rates I have for Jan, Feb, March, April and May 2017.
For April 2017 I would need Feb to June 2017, and so forth.
As I get close to the present and I do not have 2 months in the future I will use the last month's index rate one or twice (i.e. for August 2018 I need to use Jule, August and Sept 3 times).
I can do this vertically (i.e. a simple 3 column data set - and then i would need to go up and down the month range) - or if I transpose the data - then with Region on the left column and months as the top headers - it will be horizontally.
The data looks like this:
Date | Region_Name | Index |
01/01/2016 | City of London | 94.01813825 |
01/02/2016 | City of London | 97.70019819 |
01/03/2016 | City of London | 101.4148808 |
01/04/2016 | City of London | 102.7631184 |
01/05/2016 | City of London | 107.900446 |
01/06/2016 | City of London | 115.8611285 |
01/07/2016 | City of London | 116.3539602 |
01/08/2016 | City of London | 108.2436922 |
01/09/2016 | City of London | 99.59944053 |
01/10/2016 | City of London | 101.1481822 |
01/11/2016 | City of London | 104.4657271 |
01/12/2016 | City of London | 103.5881838 |
01/01/2017 | City of London | 104.6752866 |
01/02/2017 | City of London | 104.7212834 |
01/03/2017 | City of London | 108.9676798 |
01/04/2017 | City of London | 115.6290033 |
01/05/2017 | City of London | 119.9377345 |
01/06/2017 | City of London | 118.7978333 |
Or like this:
Index_Report_Month | Region | M2018_06 | M2018_05 | M2018_04 | M2018_03 | M2018_02 | M2018_01 |
2018-06 | BEDS | 136.9514 | 135.1143 | 135.4214 | 135.3575 | 135.7162 | 134.7327 |
2018-06 | BERKS | 123.3576 | 122.1503 | 121.1386 | 121.7117 | 121.7369 | 121.9318 |
2018-06 | BRISTOL | 129.4606 | 128.7055 | 129.0354 | 129.9676 | 130.9927 | 130.939 |
2018-06 | BUCKS | 122.2244 | 121.4971 | 121.595 | 122.5949 | 124.2265 | 124.7186 |
2018-06 | CAMBS | 124.9302 | 124.736 | 123.9808 | 123.4378 | 123.2373 | 123.1942 |
2018-06 | CHESHIRE | 112.9974 | 113.5666 | 113.6045 | 112.9975 | 114.1681 | 114.6474 |
2018-06 | CORNWALL | 114.9373 | 113.9692 | 112.5068 | 113.4169 | 113.5313 | 113.1154 |
Can someone please advise how this can be done?
I assume I need to use RETAIN and a macro loop but I am not sure how to do this "moving means", which in excel would be a fairly simple matter.
Much appreciated!
Edo
So the 1/7/2016 figure for London is the average of 107.900446, 115.8611285, 116.3539602, 108.2436922, and 99.59944053?
Here is an example :
proc sql;
create table want as
select *,(select sum(index) from have where date between a.date and a.date+30) as rolling_sum
from have as a;
quit;
Question 1: It appears that you want a column for each month in which a 5-month-centered average is obtainable. I also understand the reason for one row for each region. What exactly is the meaning of the index_report_month?
Question 2: Do you have sas/ets? If so then proc expand is a simple way to get various rolling window statistics, with leading, trailing, or centered windows of user-specified size. For your situation you would run the data (apparently already sorted by region/date) through proc expand with a by region statement, followed by a proc transpose - although it is not apparent why you want a wide file instead of a narrow one.
Dear mkeintz,
To answer your questions:
1. The Index Report Month is a string field that I use for each month (format of YYYY-MM) but NOT as a date field. I have found these to be so tricky that it is far better in almost all cases for me to convert the month I am working on to this format. So in this exercise I am relying on Sort as a text - not as a date. Therefore I want to avoid any date calculations as they are really redundant to what I am trying to achieve.
I am under the impression that the solution would be using a "set off" point from the MonthX (=Index_Report_Month) - i.e. climbing up and down two cells from the target month observation itself. I am not familiar how that can be done, as it asking to calculate a mean on 5 target points. I hope you guys can advise on this but if you think the solution will benefit from using a date as a date as a hook - I am not against such thinking.
2. I have Enterprise Guide, and I am not familiar with SAS/ets or Proc Expand.
I tried running an example (http://support.sas.com/documentation/cdl/en/etsug/63939/HTML/default/viewer.htm#etsug_expand_sect034...) but my SAS returned:
ERROR: Procedure EXPAND not found.
The transposed version is easier for a human to read and is more compact on screen, allowing to compare the rates in different region.
I do not mind if the solution will be done vertically (before I transpose the data or after, but there is obviously an advantage to the vertical as then I can create two matrices - one with the rolling average and one with the original data.
I hope this is helpful!
I forgot to add:
The Index_Report_Month is the month in which the report is ran, so it is irrelevant for process. It's just informs the reader in which month data the data set was correct for. Every month I get a new data set that one more time point added to it - which happened to be two months before the current month (i.e. in August I get the data correct for June, and The Index Report Month would be "2018-06").
Hi,
I have managed to work out the means logic using LAG statement.
However I am struggling how to implement it within a loop that is limited to the region (i.e to apply this logic only to the BY group of Region_Name.
This I am sure is a fairly simple loop but I am not sure how to loop between By groups.
Your help would be greatly appreciated!
The logic I would want SAS to implement within each Region-Name is as follows:
proc sort data = Test_Lag;
by Region_Name Index_Month ;
Data Test_Lag ;
set Test_Lag ;
X+1 ;
Prev_Rate = Lag (Index);
If Prev_Rate = . then Prev_Rate = Index ;
Prev2_Rate = Lag2 (Index);
If Prev2_Rate = . then Prev2_Rate = Prev_Rate ;
proc sort data = Test_Lag;
by descending x;
Data Test_Lag ;
set Test_Lag ;
Next_Rate = Lag (Index);
If Next_Rate = . then Next_Rate = Index ;
Next2_Rate = Lag2 (Index);
If Next2_Rate = . then Next2_Rate = Next_Rate ;
proc sort data = Test_Lag;
by descending Index_Month;
Data Test_Lag ;
set Test_Lag ;
Mean_5_Months = mean (Prev2_Rate, Prev_Rate, Index, Next_Rate, Next2_Rate );
run;
This problem is ready-made for a 2-dimensional array in which the row dimension is indexed by year and the column dimension is indexed by month.
This program is edited. See what happens when you don't bother making a data step, and I don't test it against data?
data have;
input Date :ddmmyy10. Region_Name :$15. Index ;
format date yymmddn8.;
datalines;
01/01/2016 City_of_London 94.01813825
01/02/2016 City_of_London 97.70019819
01/03/2016 City_of_London 101.4148808
01/04/2016 City_of_London 102.7631184
01/05/2016 City_of_London 107.900446
01/06/2016 City_of_London 115.8611285
01/07/2016 City_of_London 116.3539602
01/08/2016 City_of_London 108.2436922
01/09/2016 City_of_London 99.59944053
01/10/2016 City_of_London 101.1481822
01/11/2016 City_of_London 104.4657271
01/12/2016 City_of_London 103.5881838
01/01/2017 City_of_London 104.6752866
01/02/2017 City_of_London 104.7212834
01/03/2017 City_of_London 108.9676798
01/04/2017 City_of_London 115.6290033
01/05/2017 City_of_London 119.9377345
01/06/2017 City_of_London 118.7978333
run;
data want (drop=sum_ix index);
set have;
by region_name notsorted;
array ix_mean {2016:2018,1:12}
m2016_01-m2016_12
m2017_01-m2017_12
m2018_01-m2018_12 ;
retain m20: sum_ix;
if first.region_name then call missing(sum_ix,of ix_mean{*});
sum_ix + index - coalesce(lag5(index),0);
if lag4(region_name)=region_name then ix_mean{year(date),month(date)}=sum_ix/5;
if last.region_name;
run;
data want (drop=sum_ix);
set have;
by region notsorted;
array means {2016:2018,1,12}
m2016_01-m2016_12
m2017_01-m2017_12
m2018_01-m2018_12 ;
retain m20: sum_ix;
if first.region then call missing(sum_ix,of ix_mean{*});
sum_ix + index - coalesce(lag5(index),0);
if lag4(region)=region then mean{year(date),month(date)}=sum_index/5;
if last.regon;
run;
This program assumes that, for each region the data are sorted by date, but the regional data-groups can be in any order (hence the "by region notsorted"). It also assumes you have no holes in the data - otherwise if (say) a record for 2017/04 is missing, then (1) you will have no average for m2017_04, and (2) the results for m2017_05 through m2017_08 will erroneously have a 6-month old component.
Notes:
Dear MKeintz,
It doesn't work. The error that I get is
ERROR: Array subscript out of range at line 39 column 41.
If I break the code into two and run the first part:
data Want (drop=sum_ix index);
set Have ;
by region_name notsorted;
array ix_mean {2016:2018,1:12}
m2016_01-m2016_12
m2017_01-m2017_12
m2018_01-m2018_12 ;
retain m20: sum_ix;
run;
Then I get a huge data set that has all the date range you created in the columns (see attached).
The dataset has 23k records (as many as my observation and then columns as many as you have defined (36 I think).
Once I run the second part:
data WANT_1
; set WANT;
if first.region_name then call missing(sum_ix,of ix_mean{*});
sum_ix + index - coalesce(lag5(index),0);
if lag4(region_name)=region_name then ix_mean{year(date),month(date)}=sum_ix/5;
if last.region_name;
run;
Then I get more specific errors:
44 data WANT_1
47 ; set WANT;
49 if first.region_name then call missing(sum_ix,of ix_mean{*});
ERROR: Undeclared array referenced: ix_mean.
ERROR: The ARRAYNAME[*] specification requires an array.
50 sum_ix + index - coalesce(lag5(index),0);
51
52 if lag4(region_name)=region_name then ix_mean{year(date),month(date)}=sum_ix/5;
ERROR: Undeclared array referenced: ix_mean.
ERROR: Variable ix_mean has not been declared as an array.
53 if last.region_name;
54 run;
I am not familiar with working with arrays to be honest so this issue with "undeclared array" is an odd one for me - as I can see that you have declared it in:
array ix_mean {2016:2018,1:12}
m2016_01-m2016_12
m2017_01-m2017_12
m2018_01-m2018_12 ;
retain m20: sum_ix;
Any idea what is going wrong...?
BTW - my data has no missing observations.
Thank you fro your help!
Edo
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.