Solved: Re: Moving Average with Multiple Observations

agoldma · Posted 09-01-2017 06:24 PM

I would like to calculate a moving average and standard errors across 6 previous time periods (for each period).

Total number of periods = 50; starting the calculations on the 7th period.

I can do this in a non-statistical way, but I was wondering if it's possible to do this with PROC MODEL or PROC MIXED or another procedure.

I have many observations per period (thousands), but I have a field called period, so I'd like to use that somehow.

I would like to avoid taking averages of averages, so I don't want to reduce my data to 1 observation per period.

If I have 2 numeric variables: X and period, how can I generate a 6-period moving average over 50 periods?

I would also like to get moving standard errors in addition to that average.

Thank you

Reeza · Posted 09-04-2017 11:10 PM

Here's a macro solution with proc means.

proc sql noprint;
select min(period) into :min_period TRIMMED
from have;
select max(period) into :max_period TRIMMED
from have;
quit;

%put &min_period;
%put &max_period;

options nomprint nosymbolgen;

%macro moving_summary(datain=, window=, dataout=);

*remove previous summary table;
proc sql;
drop table &dataout;
quit;

%do i=&min_period. %to %eval(&max_period.-&window.);
proc means data=&datain noprint;
var x;
where period between &i and %eval(&i. + &window. - 1);
output out=_temp mean=mean_x std=std_x stderr=stderr_x;
run;

data _temp;
set _temp;
start = &i;
end = &i + &window - 1;
run;

proc append base=&dataout data=_temp;
run;

proc sql;
drop table _temp;
quit;

%end;
%mend;

%moving_summary(datain=have, window=6, dataout=summary_stats);

View solution in original post

ballardw · Posted 09-01-2017 07:18 PM

You may want to provide a couple cases of data (with a few records per period not 1000's) of what the start and end result of this process may be.

Do you have access to SAS/ETS? That module has a number of timeseries related procedures for working with time structures and may help.

agoldma · Posted 09-02-2017 10:07 AM

I have SAS/ETS

The data looks like this

period X

1 3

1 7

1 1

1 3

1 9

1 0

2 4

2 5

2 6

3 1

4 8

4 9

. . .

50 0

50 3

50 7

50 5

There are multiple observations of X per period.

I would like to calculate the mean and standard error of X during periods 1-6; then during periods 2-7; then during periods 3-8 … periods 45-50

I can do this with the data step, proc means, proc sql… macro.

If you think it would it be easier to use PROC MODEL or another statistical procedure, can you please suggest a procedure along with statements and parameters

Reeza · Posted 09-02-2017 11:44 AM

PROC EXPAND is your best option.

https://gist.github.com/statgeek/07a3708dee1225ceb9d4aa75daab2c52

PeterClemmensen · Posted 09-02-2017 01:08 PM

Show us some data to get a code answer 🙂 PROC EXPAND is probably your best option. Also there are numerous articles on the subject

http://www2.sas.com/proceedings/forum2008/093-2008.pdf

http://sasnrd.com/moving-average/

http://blogs.sas.com/content/iml/2016/01/27/moving-average-in-sas.html

The DATA to DATA Step Macro
Blog: SASnrd

agoldma · Posted 09-02-2017 07:26 PM

PROC EXPAND was the first thing I considered, but it doesn't allow multiple observations per period. I can summarize my data to one average-value per period, but I don't want to take averages of averages. Also, PROC EXPAND doesn't give me standard error. I need standard error in addition to the average.

I'll go with the non-statistical methods for now.

One of draycut's links pointed to a very useful SQL suggestion by the SAS Yoda himself: rolling standard deviation calculation. He uses subqueries in the select statement all the time (I need to do more of that)

Reeza · Posted 09-02-2017 08:59 PM

Days/periods with multiple observations are going to be weighted higher anyways if you don't summarize to a single metric per time period. Depending on subject matter this may or may not be ok but did want to flag it.

Another option for rolling stats is the array method, SAS notes have an example of how that can work, you'll just need to keep track of days as well.

agoldma · Posted 09-02-2017 10:29 PM

Reeza, I'm sorry, but I can't access that github site -- my internet security is blocking it.

I see that I can make weights based on the number of observations per period

transformout=( cmovave( .1 .2 .4 .2 .1 )

This should work for a moving average, but I don't think this will work for standard error because I need real record count to divide sdt.err. by sqrt(n)

If I just wanted a moving average, I could:

summarize my data to 1 record per period
count the number of records per period
use these counts of records as weights in the transformout=( cmovave(…) statement

Although… this would probably be a manual process for every period because my weights would change with every set of 6 periods (out of 50 periods)

Reeza · Posted 09-02-2017 10:44 PM

I can't help you with this until you have the logic confirmed.
That's a logical decision that you should be making based on business knowledge, requirements and statistical soundness.

If you know the rules please post a fully worked example for at least one period.

You haven't posted sample data so I'm not sure what you're expecting besides general solutions. If you need more assistance you need to post more information, including sample data and expected output.

http://support.sas.com/kb/41/380.html

agoldma · Posted 09-03-2017 03:17 PM

I attached the full data to this post. It's practically the same as in my 9/2/2017 post above.

It's in an Excel spreadsheet -- to show the calculations on the 2nd sheet (data is on the 1st sheet)

I'm interested in the Mean and Standard Error -- both are moving 1 period at a time in groups of 6 periods.

I looked at Sample 41380, but it assumes a constant quantity of observations per period (12 in that example). I couldn't modify it to work for a variable quantity of observations per period.

Reeza · Posted 09-03-2017 10:33 PM

I'm not downloading an excel sheet, few people on public forums will. The other common approach is the SQL approach you mentioned already, that doesn't work? It won't do the weights though so you'd have to use another way if you are weighting.

agoldma · Posted 09-03-2017 10:51 PM

Here's a CSV version of that Excel file (really 2 files: data and summary)

I'm mostly done with a non-statistical solution -- using a data step, proc means, transpose, sql, and a macro.

Hoping for a statistical solution -- using a statistical procedure -- which would be more streamlined than looping with a macro.

My macro is not completely working yet, but as soon as I get it working, I'll post it here.

Reeza · Posted 09-03-2017 11:10 PM

@agoldma wrote:

I'm mostly done with a non-statistical solution -- using a data step, proc means, transpose, sql, and a macro.

Hoping for a statistical solution -- using a statistical procedure -- which would be more streamlined than looping with a macro.

I'm not sure where you've gotten your terminology from, but it's uncommon and inaccurate IMO. Trying to group procedures into 'statistical' and 'non-statistical' is not a real definition. PROC EXPAND is under SAS/ETS vs SAS/STAT so is it not a statistical procedure? PROC SQL is clearly a proc, but it can do a hell of a lot...you can probably adapt this to handle weights if you clearly define the rules.

If you don't want to summarize to a single record per event, I'm not sure there's a proc that will do what you want. Note how I created your 'sample data have'. In future posts it's helpful if you do that step so we can copy and paste it directly into a browser.

EDIT/NOTE: The code below is incorrect, it does a many to many merge, rather than an inline selection. The mean ends up ok, mostly by chance but the stderr and std will be incorrect because the N is incorrect.

data have;
    informat period x 8.;
    infile cards dlm='09'x;
    input period X;
    cards;
1   3
1   7
1   1
1   3
1   9
1   0
2   4
2   5
2   6
3   1
4   8
4   9
5   9
5   6
5   6
5   3
5   3
6   8
6   9
6   8
6   9
6   8
6   3
6   6
6   7
6   3
7   3
7   4
7   9
7   5
7   6
7   7
7   4
7   3
7   0
8   2
8   3
8   3
8   4
8   8
8   7
8   4
8   4
8   7
9   2
9   2
9   8
10  7
10  3
10  8
10  3
10  5
11  5
12  9
12  2
12  2
12  3
12  1
12  4
13  7
13  3
13  6
13  7
14  0
14  1
14  6
14  4
14  3
14  9
14  5
14  5
15  7
15  6
15  5
15  8
15  3
15  7
15  3
15  4
16  0
16  3
16  8
16  8
16  9
16  7
16  4
16  2
17  1
17  6
17  9
17  2
17  5
18  4
19  1
19  2
19  7
19  7
19  6
20  2
20  9
21  0
21  6
21  5
21  8
22  9
22  8
22  1
22  6
22  4
23  0
23  5
23  2
24  4
24  7
24  6
24  4
24  6
24  8
25  6
25  5
25  7
25  5
26  1
26  2
26  8
26  2
26  6
26  5
27  9
27  1
27  8
27  3
27  4
28  4
28  8
28  8
28  0
28  8
28  5
28  9
29  1
29  4
29  3
29  3
29  3
29  8
29  6
30  0
30  7
30  9
30  1
30  3
30  0
30  3
31  5
31  3
31  3
31  8
31  1
31  7
31  6
32  5
32  4
32  0
32  6
32  3
32  7
33  6
33  3
33  3
33  7
33  3
34  8
34  3
35  1
35  6
35  4
35  2
35  8
35  2
36  5
36  5
36  4
36  2
36  1
36  1
37  6
37  8
37  9
37  2
37  6
38  8
38  6
38  1
38  7
38  4
38  2
38  2
38  2
38  2
38  8
38  5
38  3
38  6
39  6
39  4
39  4
39  5
39  0
39  0
39  4
39  7
39  5
40  2
41  8
41  2
41  1
41  7
41  2
42  7
42  4
42  6
43  4
43  4
44  7
44  0
44  6
44  3
44  3
44  5
44  1
44  3
44  9
44  6
44  7
45  0
45  3
45  1
45  5
45  3
45  7
45  8
45  1
45  2
45  7
46  2
46  1
46  3
46  0
46  1
46  7
46  2
46  4
46  6
47  9
47  9
47  7
47  0
48  2
48  0
48  4
48  9
48  1
48  2
48  1
48  0
48  4
48  1
48  8
48  4
48  4
49  6
49  7
49  2
49  5
49  7
49  1
49  0
49  2
49  7
49  5
49  1
49  0
49  2
49  3
49  6
49  5
49  3
49  7
49  7
49  9
49  5
49  4
49  0
49  1
49  6
49  7
49  1
49  3
49  5
49  0
49  8
49  7
49  3
50  0
50  3
50  7
50  5
;
run;

proc sql ;
    create table want as 
    select h1.period as start_period, 
            h1.period+5 as end_period, 
            mean(h2.x) as avg_value, 
            std(h2.x) as std_value
        /*, insert other stats as needed, except median/percentiles*/
        from have as h1 
        left join have as h2 
        on h2.period between h1.period and h1.period+5 
        group by start_period, end_period;
quit;

Note: Link removed.

agoldma · Posted 09-04-2017 12:40 AM

Thank you for this very efficient solution.

The mean looks good, but the standard deviation is off a little, and standard error is off a lot.

If you look only at the periods 1-6 (first 26 observations)

Mean = 5.53846154 this is perfect

Standard Deviation (STD) = 2.79950368

I expected 2.845779601
PROC MEANS agrees with PROC SQL and with Excel on 2.845779

Standard Error (STDERR) = 0.22413968

this is less than half of what what I expected
PROC MEANS agrees with PROC SQL and with Excel on 0.5581033

I used these 2 procedures to check the standard deviation and standard error

proc means data=have mean std stderr maxdec=8 ;

where period between 1 and 6 ;

var X ;

run;

proc sql;

select avg(X) as Mean

,std(X) as STD format=10.8

,stderr(X) as STDERR format=10.8

from have

where period between 1 and 6

;quit;

Reeza · Posted 09-04-2017 01:06 AM

Check the documentation to see what formulas are used in SQL for Standard deviation, the denominator can vary based on your definition.

Registration is open