DATA Step, Macro, Functions and more

Rolling 5 Month Average of Central 3 Values By Group

Reply
Contributor
Posts: 30

Rolling 5 Month Average of Central 3 Values By Group

[ Edited ]

I have data sorted by ascending GROUP INFORCE and LAG, as shown below, and I'm interested in calculating a kind of "5 period rolling average" for Cum_Lag_Incrs. The way I need to compute my rolling average is by removing the Max and Min relevant observation and dividing by 3 (e.g. the "average" of the set {1,2,3,4,5} would be 3).

 

In addition to the average being based on 3 central observations of 5, it needs to come from the correct subset of data.

 

To be specific, for example: For GROUP=AA and LAG=1, Roll_Avg will be based only on the previous 5 GROUP=AA and LAG=1 values of Cum_Lag_Incrs; for GROUP=AB and LAG=2, Roll_Avg will be based only on the previous 5 GROUP=AB and LAG=2 values of Cum_Lag_Incrs; etc.

 

Note also that in data below I've shown Roll_Avg as either blank or "Calculate": I've done this because if there aren't 5 previous month of data available, then no calculation should be made.

 

Group INFORCE Lag Cum_Lag_Incrs Roll_Avg
AA 201301 0    
AA 201301 1 3.038434616  
AA 201301 2 1.13597727  
AA 201301 3 1.088584718  
AA 201302 0    
AA 201302 1 3.322640299  
AA 201302 2 1.182485247  
AA 201302 3 1.036479476  
AA 201303 0    
AA 201303 1 3.413788991  
AA 201303 2 1.120095681  
AA 201303 3 1.047491904  
AA 201304 0    
AA 201304 1 2.696971256  
AA 201304 2 1.110429149  
AA 201304 3 1.046630304  
AA 201305 0    
AA 201305 1 2.700299318  
AA 201305 2 1.111799418  
AA 201305 3 1.045353415  
AA 201306 0   Calculate
AA 201306 1 3.156808222 Calculate
AA 201306 2 1.213467751 Calculate
AA 201306 3 1.152644537 Calculate
AA 201307 0   Calculate
AA 201307 1 3.149903686 Calculate
AA 201307 2 1.236895962 Calculate
AA 201307 3 1.038549449 Calculate
AA 201308 0   Calculate
AA 201308 1 3.388799982 Calculate
AA 201308 2 1.17331458 Calculate
AA 201308 3 1.037821673 Calculate
AA 201309 0   Calculate
AA 201309 1 2.223705515 Calculate
AA 201309 2 1.107696717 Calculate
AA 201309 3 1.067502797 Calculate
AA 201310 0   Calculate
AA 201310 1 2.598374951 Calculate
AA 201310 2 1.183557366 Calculate
AA 201310 3 1.068765806 Calculate
AA 201311 0   Calculate
AA 201311 1 2.684124925 Calculate
AA 201311 2 1.882073772 Calculate
AA 201311 3 1.639452593 Calculate
AB 201301 0    
AB 201301 1 3.034393497  
AB 201301 2 1.13446642  
AB 201301 3 1.087136901  
AB 201302 0    
AB 201302 1 3.326727146  
AB 201302 2 1.183939704  
AB 201302 3 1.035100958  
AB 201303 0    
AB 201303 1 3.409248651  
AB 201303 2 1.121473399  
AB 201303 3 1.04609874  
AB 201304 0    
AB 201304 1 2.693384284  
AB 201304 2 1.108952278  
AB 201304 3 1.047917659  
AB 201305 0    
AB 201305 1 2.703620687  
AB 201305 2 1.113166932  
AB 201305 3 1.046639199  
AB 201306 0   Calculate
AB 201306 1 3.160691097 Calculate
AB 201306 2 1.214960317 Calculate
AB 201306 3 1.15111152 Calculate
AB 201307 0   Calculate
AB 201307 1 3.145714314 Calculate
AB 201307 2 1.238417344 Calculate
AB 201307 3 1.037168178 Calculate
AB 201308 0   Calculate
AB 201308 1 3.384292878 Calculate
AB 201308 2 1.171754072 Calculate
AB 201308 3 1.039098194 Calculate
AB 201309 0   Calculate
AB 201309 1 2.226440673 Calculate
AB 201309 2 1.106223481 Calculate
AB 201309 3 1.068815825 Calculate
AB 201310 0   Calculate
AB 201310 1 2.601570952 Calculate
AB 201310 2 1.181983235 Calculate
AB 201310 3 1.070080388 Calculate
AB 201311 0   Calculate
AB 201311 1 2.680555039 Calculate
AB 201311 2 1.879570614 Calculate
AB 201311 3 1.64146912 Calculate

 

For clarification, note the following: Cum_Lag_Incrs is the ratio between a cumulative total (not shown here) in Lag N+1 and Lag N where N=0,1,2,3. So Cum_Lag_Incrs is always 0 for LAG=0 since there is no cumulative total to speak of before LAG 0. It's like if you sold a guy a car on a payment plan and you tracked the ratio between cumulative payments in this month and cumulative payments in last month; in the base month (the month you sold the car) there would be no cumulative total for this guy through the previous month.

 

Also, as an example, here's how I'd calculate the Roll_Avg for GROUP=AA, INFORCE=201306, Lag=0,1,2,3. Note that the previous 5 relevant data points are the previous 5 relevant data points from which the Roll_Avg will be calculated (e.g. for GROUP=AA, INFORCE=201306, Lag=1, the relevant values are Cum_Lag_Incrs from 201301-201305).

 

Group INFORCE Lag Cum_Lag_Incrs Roll_Avg Previous 5 Relevant Data Points
AA 201306 0   0 0 0 0 0 0
AA 201306 1 3.156808222 3.020458078 3.038434616 3.322640299 3.413788991 2.696971256 2.700299318
AA 201306 2 1.213467751 =(1.13597727+1.182485247+1.120095681+1.110429149+1.111799418-1.110429149-1.182485247)/3 1.13597727 1.182485247 1.120095681 1.110429149 1.111799418
AA 201306 3 1.152644537 1.046491874 1.088584718 1.036479476 1.047491904 1.046630304 1.045353415
PROC Star
Posts: 169

Re: Rolling 5 Month Average of Central 3 Values By Group

Hi, I am little slow to grasp. May I ask a couple of questions plz-

 

1.  Can you explain the blanks in the Cum_Lag_Incrs?

2. Can you manually show an example of filling the 4 or 5 records of the calculate from the sample data or in other words, what will be the value of the calculate(roll avg) for the following records-

AA2013060 Calculate
AA20130613.156808222Calculate
AA20130621.213467751Calculate
AA20130631.152644537Calculate
Contributor
Posts: 30

Re: Rolling 5 Month Average of Central 3 Values By Group

@novinosrin 

 

Thanks for taking an interest in this question.  See my updated post above!

PROC Star
Posts: 169

Re: Rolling 5 Month Average of Central 3 Values By Group

[ Edited ]

Hi, I have got the solution below that is close, however I still need help on dividing by 3 . Please test and let me know where I haven't understood the convert logic. I will rework and fix it properly. It should be a minor fix

 

data have;

infile datalines expandtabs truncover;

input Group $   INFORCE    Lag  Cum_Lag_Incrs   ;

datalines;

AA   201301     0         

AA   201301     1    3.038434616    

AA   201301     2    1.13597727

AA   201301     3    1.088584718    

AA   201302     0         

AA   201302     1    3.322640299    

AA   201302     2    1.182485247    

AA   201302     3    1.036479476    

AA   201303     0         

AA   201303     1    3.413788991    

AA   201303     2    1.120095681    

AA   201303     3    1.047491904    

AA   201304     0         

AA   201304     1    2.696971256    

AA   201304     2    1.110429149    

AA   201304     3    1.046630304    

AA   201305     0         

AA   201305     1    2.700299318    

AA   201305     2    1.111799418    

AA   201305     3    1.045353415    

AA   201306     0                   Calculate

AA   201306     1    3.156808222     Calculate

AA   201306     2    1.213467751     Calculate

AA   201306     3    1.152644537     Calculate

AA   201307     0                   Calculate

AA   201307     1    3.149903686     Calculate

AA   201307     2    1.236895962     Calculate

AA   201307     3    1.038549449     Calculate

AA   201308     0                   Calculate

AA   201308     1    3.388799982     Calculate

AA   201308     2    1.17331458 Calculate

AA   201308     3    1.037821673     Calculate

AA   201309     0                   Calculate

AA   201309     1    2.223705515     Calculate

AA   201309     2    1.107696717     Calculate

AA   201309     3    1.067502797     Calculate

AA   201310     0                   Calculate

AA   201310     1    2.598374951     Calculate

AA   201310     2    1.183557366     Calculate

AA   201310     3    1.068765806     Calculate

AA   201311     0                   Calculate

AA   201311     1    2.684124925     Calculate

AA   201311     2    1.882073772     Calculate

AA   201311     3    1.639452593     Calculate

AB   201301     0         

AB   201301     1    3.034393497    

AB   201301     2    1.13446642

AB   201301     3    1.087136901    

AB   201302     0         

AB   201302     1    3.326727146    

AB   201302     2    1.183939704    

AB   201302     3    1.035100958    

AB   201303     0         

AB   201303     1    3.409248651    

AB   201303     2    1.121473399    

AB   201303     3    1.04609874

AB   201304     0         

AB   201304     1    2.693384284    

AB   201304     2    1.108952278    

AB   201304     3    1.047917659    

AB   201305     0         

AB   201305     1    2.703620687    

AB   201305     2    1.113166932    

AB   201305     3    1.046639199    

AB   201306     0                   Calculate

AB   201306     1    3.160691097     Calculate

AB   201306     2    1.214960317     Calculate

AB   201306     3    1.15111152 Calculate

AB   201307     0                   Calculate

AB   201307     1    3.145714314     Calculate

AB   201307     2    1.238417344     Calculate

AB   201307     3    1.037168178     Calculate

AB   201308     0                   Calculate

AB   201308     1    3.384292878     Calculate

AB   201308     2    1.171754072     Calculate

AB   201308     3    1.039098194     Calculate

AB   201309     0                   Calculate

AB   201309     1    2.226440673     Calculate

AB   201309     2    1.106223481     Calculate

AB   201309     3    1.068815825     Calculate

AB   201310     0                   Calculate

AB   201310     1    2.601570952     Calculate

AB   201310     2    1.181983235     Calculate

AB   201310     3    1.070080388     Calculate

AB   201311     0                   Calculate

AB   201311     1    2.680555039     Calculate

AB   201311     2    1.879570614     Calculate

AB   201311     3    1.64146912 Calculate

;

 

proc sort data= have out=have1;

by group lag;

run;

 

data temp;

do until(last.group);

do _n_=1 by 1 until(last.lag);

set have1;

by  group lag;

if lag=0 then do;

output;

leave;

end;

l1=lag(cum_lag_incrs);

l2=lag2(cum_lag_incrs);

l3=lag3(cum_lag_incrs);

l4=lag4(cum_lag_incrs);

l5=lag5(cum_lag_incrs);

if _n_=1 then call missing(rollavg,l1,l2,l3,l4,l5);

if _n_>=6 then rollavg=mean(l1,l2,l3,l4,l5);

output;

end;

end;

drop l1-l5;

run;

 

proc sort data= temp out=want;

by group inforce lag;

run;

  

 

Regards,

Naveen Srinivasan

 

PS I assumed you are taking one year data of 2013 and your sample data resembles very much of the actual. Assumptions might make me lose track but I love these kinda problems,so I am in it for your feedback and improvment. Thanks!

 

 

 

Super User
Posts: 17,815

Re: Rolling 5 Month Average of Central 3 Values By Group

Do you have SAS/ETS? 

If so, I recommend using PROC TIMESERIES To fill in the dataset so you have an entry for every 'period' required.

https://gist.github.com/statgeek/07a3708dee1225ceb9d4aa75daab2c52

 

Because of the Max/Min options, you're best off cacluating the mean using the array method that's illustrated here

https://gist.github.com/statgeek/27e23c015eae7953eff2

 

 

Contributor
Posts: 30

Re: Rolling 5 Month Average of Central 3 Values By Group

I do have SAS/ETS. Thanks for the tips.
Contributor
Posts: 30

Re: Rolling 5 Month Average of Central 3 Values By Group

Using the min/max array link you provided, I'm able to achieve what I want, but in a convoluted way.  

 

The biggeest issue I have is that I don't want to compute the trimmed average until there are 5 previous datapoints to base my calculation off of.  So I'm numbering the observation from oldest to newest within each grouping of LAGs from 1 and up, and then after the calculations have all occurred I'm deleting the rows on which calculations ocurred, but shouldn't have ocurred.

 

Any suggestions for how to make this code a little better?

 

data want_00;
set have;
BY Group LAG INFORCE; 
if first.LAG then count=0;
count+1;
run;


data want_01;
array p{0:4} _temporary_;
set want_00;
if first.LAG then call missing(of p{*});
lowest = min(of p{*});
highest = max(of p{*});
sum = sum(of p{*});
p{mod(_n_,5)} = Cum_Lag_Incrs;
if count< 5 then highest ="";
if count<5 then lowest ="";
if count<5 then sum="";
average = (sum-highest-lowest)/3;
run;
Super User
Posts: 17,815

Re: Rolling 5 Month Average of Central 3 Values By Group

if count < 5 then call missing(highest, lowest, sum);

You can use CALL MISSING to set multiple values to missing in one step. It doesn't matter if they're numeric or character. 

 

Just a note that you were setting the missing to "" which is missing for a character variable not a numeric variable. 

Missing for numerics are denoted with a period (.) and it should have been:

if count < 5 then highest = .;

I went with your current variable names, but SAS has FUNCTIONS with the names SUM and COUNT so I would consider using other variable names instead, if possible.

Contributor
Posts: 30

Re: Rolling 5 Month Average of Central 3 Values By Group

[ Edited ]

@novinosrin I am away from SAS until this afternoon, so I can't test until then.  But thank you for your reply!


We divide by 3 because I want the average of the 5 relevant data points AFTER I throw out the max and min observation. 

 

For example(s) (here I use integers for simplicity):

If relevant 5 Cum_Lag_Incrs was {1,3,5,7,11} then I throw out 11 and 1 and I take average of the remaining set {3,5,7} and I get (3+5+7)/3=5

 

{5,33,9,8,22} would have average (9+8+22)/3=13

{2,4,5,68,,8} would have average (4+5+8)/3=5.666

Super User
Posts: 17,815

Re: Rolling 5 Month Average of Central 3 Values By Group

@acemanhattan this is known as a trimmed mean. You can also use PROC EXPAND to see if can do a Trimmed Mean. 

Frequent Contributor
Posts: 123

Re: Rolling 5 Month Average of Central 3 Values By Group

[ Edited ]

Hi,

 

Transposing the dataset for making the calculation easy solves the complexicity. Understand my code for your solution. Validated the results, they are exactly matching as you required. 

DATA WORK.Sample_Test;
LENGTH Group $2 INFORCE 8 Lag 8 Cum_Lag_Incrs $ 11 ;
FORMAT Group $CHAR2. INFORCE BEST12. Lag BEST12. Cum_Lag_Incrs $CHAR11.;
INFORMAT Group $CHAR2. INFORCE BEST12. Lag BEST12. Cum_Lag_Incrs $CHAR11. ;
INFILE DATALINES4 DLM="," MISSOVER DSD;
INPUT Group: $CHAR2. INFORCE: BEST32. Lag: BEST32. Cum_Lag_Incrs: $CHAR11. ;
DATALINES4;
AA,201301,0, 
AA,201301,1,3.038434616
AA,201301,2,1.13597727
AA,201301,3,1.088584718
AA,201302,0, 
AA,201302,1,3.322640299
AA,201302,2,1.182485247
AA,201302,3,1.036479476
AA,201303,0, 
AA,201303,1,3.413788991
AA,201303,2,1.120095681
AA,201303,3,1.047491904
AA,201304,0, 
AA,201304,1,2.696971256
AA,201304,2,1.110429149
AA,201304,3,1.046630304
AA,201305,0, 
AA,201305,1,2.700299318
AA,201305,2,1.111799418
AA,201305,3,1.045353415
AA,201306,0, 
AA,201306,1,3.156808222
AA,201306,2,1.213467751
AA,201306,3,1.152644537
AA,201307,0, 
AA,201307,1,3.149903686
AA,201307,2,1.236895962
AA,201307,3,1.038549449
AA,201308,0, 
AA,201308,1,3.388799982
AA,201308,2,1.17331458
AA,201308,3,1.037821673
AA,201309,0, 
AA,201309,1,2.223705515
AA,201309,2,1.107696717
AA,201309,3,1.067502797
AA,201310,0, 
AA,201310,1,2.598374951
AA,201310,2,1.183557366
AA,201310,3,1.068765806
AA,201311,0, 
AA,201311,1,2.684124925
AA,201311,2,1.882073772
AA,201311,3,1.639452593
AB,201301,0, 
AB,201301,1,3.034393497
AB,201301,2,1.13446642
AB,201301,3,1.087136901
AB,201302,0, 
AB,201302,1,3.326727146
AB,201302,2,1.183939704
AB,201302,3,1.035100958
AB,201303,0, 
AB,201303,1,3.409248651
AB,201303,2,1.121473399
AB,201303,3,1.04609874
AB,201304,0, 
AB,201304,1,2.693384284
AB,201304,2,1.108952278
AB,201304,3,1.047917659
AB,201305,0, 
AB,201305,1,2.703620687
AB,201305,2,1.113166932
AB,201305,3,1.046639199
AB,201306,0, 
AB,201306,1,3.160691097
AB,201306,2,1.214960317
AB,201306,3,1.15111152
AB,201307,0, 
AB,201307,1,3.145714314
AB,201307,2,1.238417344
AB,201307,3,1.037168178
AB,201308,0, 
AB,201308,1,3.384292878
AB,201308,2,1.171754072
AB,201308,3,1.039098194
AB,201309,0, 
AB,201309,1,2.226440673
AB,201309,2,1.106223481
AB,201309,3,1.068815825
AB,201310,0, 
AB,201310,1,2.601570952
AB,201310,2,1.181983235
AB,201310,3,1.070080388
AB,201311,0, 
AB,201311,1,2.680555039
AB,201311,2,1.879570614
AB,201311,3,1.64146912
;;;;



Proc sort data=sample_test;
by Group lag inforce Cum_Lag_Incrs;
run;
data test;
set sample_test;
Cum_Lag=INPUT(Cum_Lag_Incrs,12.6);
IF MISSING(Cum_Lag) then Cum_Lag=0;
run;

PROC TRANSPOSE DATA=test prefix=INFORCE_ out=test2(DROP=_NAME_);
BY Group lag;
ID inforce;
VAR Cum_Lag;
run;

PROC SQL;
SELECT NAME INTO: variables separated by " "
	FROM DICTIONARY.COLUMNS
		WHERE LIBNAME="WORK" AND MEMNAME="TEST2" AND NAME LIKE "INFORCE%";
QUIT;
OPTIONS SYMBOLGEN Mprint MLOGIC;

%MACRO TEST();
DATA max_Val (DROP=&variables);
SET test2;
%DO I=1 %to 6;
%LET Start_Value=%SCAN("&variables",&i," ");
%LET end_Value  =%SCAN("&variables",%EVAL(&i+4)," ");
%LET New_Value  =%SCAN("&variables",%EVAL(&i+5)," ");

Required_&New_Value=(SUM(OF &Start_Value.-&end_Value.)-MAX(OF &Start_Value.-&end_Value.)-Min(OF &Start_Value.-&end_Value.))/3;
%END;
RUN;
%MEND;
%TEST();

PROC TRANSPOSE DATA=max_Val out=required(rename=(_Name_=INFORCE_ COL1=Roll_Avg));
BY Group lag;
RUN;

DATA pre(Drop=Inforce_);
set required;
INFORCE=INPUT(SCAN(INFORCE_,3,"_"),8.);
run;
proc sort data=sample_test;
by Group Lag Inforce;
Proc sort data=pre;
by Group lag inforce ;
run;
data need;
Merge sample_test pre;
by Group Lag Inforce;
RUN;
PROC SORT DATA=need;
BY Group INFORCE;
run;

 

 Thanks,

Suryakiran

Thanks,
Suryakiran
Frequent Contributor
Posts: 123

Re: Rolling 5 Month Average of Central 3 Values By Group

@acemanhattan have you looked my solution. You can transpose the original data and then use a loop to calculate the rolling average from the point where you need using simple do loop.

Thanks,
Suryakiran
Ask a Question
Discussion stats
  • 11 replies
  • 243 views
  • 8 likes
  • 4 in conversation