I have a dataset of analyst forecasts. I am wanting to produce a rolling average for each firm. Normally, this would be simple and I could accomplish using EXPAND or using an accumulator variable and dividing by the count. However, In my case an analyst can make more than one forecast in a period. If they do, I only want one of their forecasts to go into the rolling average at a time. I am copying below an example of the data.
DATE | FIRMID | ANALYSTID | ESTIMATE | Average(want) |
1/1/2015 | 1 | 1 | 0.5 | 0.5 |
1/1/2015 | 1 | 2 | 0.6 | 0.55 |
1/4/2015 | 1 | 3 | 0.7 | 0.6 |
1/10/2015 | 1 | 2 | 0.8 | 0.666666667 |
1/12/2015 | 1 | 3 | 0.9 | 0.733333333 |
1/12/2015 | 1 | 1 | 1 | 0.9 |
Up until (and including) 1/4/2015 it is a normal rolling average. However, starting with 1/10/2015 previous estimates drop out of the average. Specifically, the 1/1/2015 estimate by analyst 2 should no longer be included once the 1/10/2015 estimate by analyst 2 is now included.
The data I copied in is just an example. In reality, it goes across many years, firms, and analysts. I would want the average to reset for each firm-quarter. However, if I can get the logic for my example data, I believe I can extend it to the broader dataset.
I appreciate any help you have to offer.
Hi @Jacob3 Now you are talking my friend. Both your anaylse an username have values that are not in sequential numbers like your original and those are some large numbers , so of course array index presumption didn't work.
So one way is to simply raise the array subscript value which I did to 999999 and calculate avgs for non missings, so here it is. Let me know if it is getting close. Temp array is by far the fastest and arguably the best approach, but if the index values(of username and analyse) are awefully large and alphanumeric, we could jump to Hash approach. But try this and let me know
Also, Your latest reads all as numeric, but this conversion from char to num is no biggie as long as your analyse/username are not alphanumeric. If alphanumeric, Hash would be our sweetheart to date.
data have;
input obs ANALYS year2 quarter2 est_date3 value1 value2 permno1 username1;
cards;
1 073367 2017 1 20799 0.090 . 24643 .
2 076852 2017 1 20807 0.380 . 24643 .
3 114449 2017 1 20808 0.456 . 24643 .
4 073867 2017 1 20808 -0.170 . 24643 .
5 130621 2017 1 20823 0.370 . 24643 .
6 125794 2017 1 20824 0.030 . 24643 .
7 115965 2017 1 20828 0.040 . 24643 .
8 114449 2017 1 20835 0.211 . 24643 .
9 109116 2017 1 20836 0.300 . 24643 .
10 125794 2017 1 20837 0.210 . 24643 .
11 073367 2017 1 20837 -0.120 . 24643 .
12 073367 2017 1 20837 -0.090 . 24643 .
13 073867 2017 1 20838 0.190 . 24643 .
14 . 2017 1 20842 . 0.20 24643 372497
15 . 2017 1 20842 . 0.19 24643 379066
16 . 2017 1 20842 . 0.20 24643 51442
17 . 2017 1 20843 . 0.19 24643 368592
18 . 2017 1 20843 . 0.20 24643 372497
19 . 2017 1 20843 . 0.22 24643 68250
20 109116 2017 2 20843 0.360 . 24643 .
21 115965 2017 2 20844 0.300 . 24643 .
22 073867 2017 2 20844 0.840 . 24643 .
23 130621 2017 2 20846 0.540 . 24643 .
24 125794 2017 2 20849 0.460 . 24643 .
25 076852 2017 2 20850 0.230 . 24643 .
;
run;
proc contents data=have;
run;
data want2;
array t (999999)_temporary_;
array j(999999)_temporary_;
do until(last.quarter2);
set have;
by permno1 year2 quarter2;
if not missing(username1) then t(username1)=value2;
if not missing(analys) then j(analys)=value1;
if n(of t(*))>0 then avg1=mean(of t(*));
if n(of j(*))>0 then avg2=mean(of j(*));
output;
end;
call missing(of t(*),of j(*));
run;
PS
Always post the nearest fake sample, otherwise it's gonna be a ding dong. The pendulum or seesaw never stops
I'd suggest post your real data(mocked of course) to challenge the community 🙂
For your sample
data have;
input DATE :mmddyy10. FIRMID ANALYSTID ESTIMATE ;
format date mmddyy10.;
cards;
1/1/2015 1 1 0.5 0.5
1/1/2015 1 2 0.6 0.55
1/4/2015 1 3 0.7 0.6
1/10/2015 1 2 0.8 0.666666667
1/12/2015 1 3 0.9 0.733333333
1/12/2015 1 1 1 0.9
;
data want;
array t (9999)_temporary_;
do until(last.firmid);
set have;
by FIRMID;
t(analystid)=estimate;
avg=mean(of t(*));
output;
end;
call missing(of t(*));
run;
Thanks for your reply novinosrin!
The real data is of the form:
DATE | Year | Quarter | FIRMID | ANALYSTID | ESTIMATE | Average(want) | ANALYSTID2 | ESTIMATE2 | Average2(want) |
1/1/2015 | 2015 | 1 | 1 | 1 | 0.5 | 0.5 | 10 | 0.6 | 0.6 |
1/1/2015 | 2015 | 1 | 1 | 2 | 0.6 | 0.55 | . | . | 0.6 |
1/4/2015 | 2015 | 1 | 1 | 3 | 0.7 | 0.6 | 12 | 0.8 | 0.7 |
1/10/2015 | 2015 | 1 | 1 | 2 | 0.8 | 0.666666667 | 13 | 1 | 0.8 |
1/12/2015 | 2015 | 1 | 1 | 3 | 0.9 | 0.733333333 | 10 | 0.9 | 0.9 |
1/12/2015 | 2015 | 1 | 1 | 1 | 1 | 0.9 | 13 | 0.9 | 0.866666667 |
4/1/2015 | 2015 | 2 | 1 | 1 | 0.5 | 0.5 | 10 | 0.6 | 0.6 |
4/1/2015 | 2015 | 2 | 1 | 2 | 0.6 | 0.55 | . | . | 0.6 |
4/4/2015 | 2015 | 2 | 1 | 3 | 0.7 | 0.6 | 12 | 0.8 | 0.7 |
4/10/2015 | 2015 | 2 | 1 | 2 | 0.8 | 0.666666667 | 13 | 1 | 0.8 |
4/12/2015 | 2015 | 2 | 1 | 3 | 0.9 | 0.733333333 | 10 | 0.9 | 0.9 |
4/12/2015 | 2015 | 2 | 1 | 1 | 1 | 0.9 | 13 | 0.9 | 0.866666667 |
7/1/2015 | 2015 | 3 | 1 | 1 | 0.5 | 0.5 | 10 | 0.6 | 0.6 |
7/1/2015 | 2015 | 3 | 1 | 2 | 0.6 | 0.55 | . | . | 0.6 |
7/4/2015 | 2015 | 3 | 1 | 3 | 0.7 | 0.6 | 12 | 0.8 | 0.7 |
7/10/2015 | 2015 | 3 | 1 | 2 | 0.8 | 0.666666667 | 13 | 1 | 0.8 |
7/12/2015 | 2015 | 3 | 1 | 3 | 0.9 | 0.733333333 | 10 | 0.9 | 0.9 |
7/12/2015 | 2015 | 3 | 1 | 1 | 1 | 0.9 | 13 | 0.9 | 0.866666667 |
10/1/2015 | 2015 | 4 | 1 | 1 | 0.5 | 0.5 | 10 | 0.6 | 0.6 |
10/1/2015 | 2015 | 4 | 1 | 2 | 0.6 | 0.55 | . | . | 0.6 |
10/4/2015 | 2015 | 4 | 1 | 3 | 0.7 | 0.6 | 12 | 0.8 | 0.7 |
10/10/2015 | 2015 | 4 | 1 | 2 | 0.8 | 0.666666667 | 13 | 1 | 0.8 |
10/12/2015 | 2015 | 4 | 1 | 3 | 0.9 | 0.733333333 | 10 | 0.9 | 0.9 |
10/12/2015 | 2015 | 4 | 1 | 1 | 1 | 0.9 | 13 | 0.9 | 0.866666667 |
1/1/2016 | 2016 | 1 | 1 | 1 | 0.5 | 0.5 | 10 | 0.6 | 0.6 |
1/1/2016 | 2016 | 1 | 1 | 2 | 0.6 | 0.55 | . | . | 0.6 |
1/4/2016 | 2016 | 1 | 1 | 3 | 0.7 | 0.6 | 12 | 0.8 | 0.7 |
1/10/2016 | 2016 | 1 | 1 | 2 | 0.8 | 0.666666667 | 13 | 1 | 0.8 |
1/12/2016 | 2016 | 1 | 1 | 3 | 0.9 | 0.733333333 | 10 | 0.9 | 0.9 |
1/12/2016 | 2016 | 1 | 1 | 1 | 1 | 0.9 | 13 | 0.9 | 0.866666667 |
1/1/2015 | 2015 | 1 | 2 | 1 | 0.5 | 0.5 | 10 | 0.6 | 0.6 |
1/1/2015 | 2015 | 1 | 2 | 2 | 0.6 | 0.55 | . | . | 0.6 |
1/4/2015 | 2015 | 1 | 2 | 3 | 0.7 | 0.6 | 12 | 0.8 | 0.7 |
1/10/2015 | 2015 | 1 | 2 | 2 | 0.8 | 0.666666667 | 13 | 1 | 0.8 |
1/12/2015 | 2015 | 1 | 2 | 3 | 0.9 | 0.733333333 | 10 | 0.9 | 0.9 |
1/12/2015 | 2015 | 1 | 2 | 1 | 1 | 0.9 | 13 | 0.9 | 0.866666667 |
Where I want the average to be computed within year quarter and firm for two separate types of analysts (analystid & analystid2), while omitting past values of estimates from the moving average when the same analyst makes a new estimate.
Hello @Jacob3 Still the same logic, HTH
data have(drop=avg:) expected_want;
input DATE :mmddyy10. Year Quarter FIRMID ANALYSTID ESTIMATE avg1 ANALYSTID2 ESTIMATE2 avg2;
format date mmddyy10.;
cards;
1/1/2015 2015 1 1 1 0.5 0.5 10 0.6 0.6
1/1/2015 2015 1 1 2 0.6 0.55 . . 0.6
1/4/2015 2015 1 1 3 0.7 0.6 12 0.8 0.7
1/10/2015 2015 1 1 2 0.8 0.666666667 13 1 0.8
1/12/2015 2015 1 1 3 0.9 0.733333333 10 0.9 0.9
1/12/2015 2015 1 1 1 1 0.9 13 0.9 0.866666667
4/1/2015 2015 2 1 1 0.5 0.5 10 0.6 0.6
4/1/2015 2015 2 1 2 0.6 0.55 . . 0.6
4/4/2015 2015 2 1 3 0.7 0.6 12 0.8 0.7
4/10/2015 2015 2 1 2 0.8 0.666666667 13 1 0.8
4/12/2015 2015 2 1 3 0.9 0.733333333 10 0.9 0.9
4/12/2015 2015 2 1 1 1 0.9 13 0.9 0.866666667
7/1/2015 2015 3 1 1 0.5 0.5 10 0.6 0.6
7/1/2015 2015 3 1 2 0.6 0.55 . . 0.6
7/4/2015 2015 3 1 3 0.7 0.6 12 0.8 0.7
7/10/2015 2015 3 1 2 0.8 0.666666667 13 1 0.8
7/12/2015 2015 3 1 3 0.9 0.733333333 10 0.9 0.9
7/12/2015 2015 3 1 1 1 0.9 13 0.9 0.866666667
10/1/2015 2015 4 1 1 0.5 0.5 10 0.6 0.6
10/1/2015 2015 4 1 2 0.6 0.55 . . 0.6
10/4/2015 2015 4 1 3 0.7 0.6 12 0.8 0.7
10/10/2015 2015 4 1 2 0.8 0.666666667 13 1 0.8
10/12/2015 2015 4 1 3 0.9 0.733333333 10 0.9 0.9
10/12/2015 2015 4 1 1 1 0.9 13 0.9 0.866666667
1/1/2016 2016 1 1 1 0.5 0.5 10 0.6 0.6
1/1/2016 2016 1 1 2 0.6 0.55 . . 0.6
1/4/2016 2016 1 1 3 0.7 0.6 12 0.8 0.7
1/10/2016 2016 1 1 2 0.8 0.666666667 13 1 0.8
1/12/2016 2016 1 1 3 0.9 0.733333333 10 0.9 0.9
1/12/2016 2016 1 1 1 1 0.9 13 0.9 0.866666667
1/1/2015 2015 1 2 1 0.5 0.5 10 0.6 0.6
1/1/2015 2015 1 2 2 0.6 0.55 . . 0.6
1/4/2015 2015 1 2 3 0.7 0.6 12 0.8 0.7
1/10/2015 2015 1 2 2 0.8 0.666666667 13 1 0.8
1/12/2015 2015 1 2 3 0.9 0.733333333 10 0.9 0.9
1/12/2015 2015 1 2 1 1 0.9 13 0.9 0.866666667
;
data want;
array t (9999)_temporary_;
array j(9999)_temporary_;
do until(last.quarter);
set have;
by FIRMID year quarter;
if not missing(analystid) then t(analystid)=estimate;
if not missing(analystid2) then j(analystid2)=estimate2;
avg1=mean(of t(*));
avg2=mean(of j(*));
output;
end;
call missing(of t(*),of j(*));
run;
Sorry to bother, but I keep getting "ERROR: Array subscript out of range". I can run the code 2 times back to back and it will give me different lines at which this error is occurring. Do you have any advice on how to remedy this?
No bother at all. If your sample is truly a good representative one, that should work as it does for the test have and want. I'll try to look into it tomorrow
Thanks again.
The only things I can discern that are different between my sample and "have" is that firmid and analystid2 are character variables in my dataset.
Ah ok @Jacob3
Please do me a favor by thoroughly checking variable attributes and the values of a few records and message back plz
I kept the first 25 observations from my dataset and read it into sas inline. The below does not work.
data have;
input obs ANALYS year2 quarter2 est_date3 value1 value2 permno1 username1;
cards;
1 073367 2017 1 20799 0.090 . 24643 .
2 076852 2017 1 20807 0.380 . 24643 .
3 114449 2017 1 20808 0.456 . 24643 .
4 073867 2017 1 20808 -0.170 . 24643 .
5 130621 2017 1 20823 0.370 . 24643 .
6 125794 2017 1 20824 0.030 . 24643 .
7 115965 2017 1 20828 0.040 . 24643 .
8 114449 2017 1 20835 0.211 . 24643 .
9 109116 2017 1 20836 0.300 . 24643 .
10 125794 2017 1 20837 0.210 . 24643 .
11 073367 2017 1 20837 -0.120 . 24643 .
12 073367 2017 1 20837 -0.090 . 24643 .
13 073867 2017 1 20838 0.190 . 24643 .
14 . 2017 1 20842 . 0.20 24643 372497
15 . 2017 1 20842 . 0.19 24643 379066
16 . 2017 1 20842 . 0.20 24643 51442
17 . 2017 1 20843 . 0.19 24643 368592
18 . 2017 1 20843 . 0.20 24643 372497
19 . 2017 1 20843 . 0.22 24643 68250
20 109116 2017 2 20843 0.360 . 24643 .
21 115965 2017 2 20844 0.300 . 24643 .
22 073867 2017 2 20844 0.840 . 24643 .
23 130621 2017 2 20846 0.540 . 24643 .
24 125794 2017 2 20849 0.460 . 24643 .
25 076852 2017 2 20850 0.230 . 24643 .
;
run;
proc contents data=have;
run;
data want2;
array t (9999)_temporary_;
array j(9999)_temporary_;
do until(last.quarter2);
set have;
by permno1 year2 quarter2;
if not missing(username1) then t(username1)=value2;
if not missing(analys) then j(analys)=value1;
avg1=mean(of t(*));
avg2=mean(of j(*));
output;
end;
call missing(of t(*),of j(*));
run;
Below I am attaching my log file in case that helps. I am probably missing something silly.
71
72 data have;
73 input obs ANALYS year2 quarter2 est_date3 value1 value2 permno1 username1;
74
75 cards;
NOTE: SAS went to a new line when INPUT statement reached past the end of a line.
NOTE: The data set WORK.HAVE has 25 observations and 9 variables.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.00 seconds
102 ;
103 run;
104
105 proc contents data=have;
106 run;
NOTE: PROCEDURE CONTENTS used (Total process time):
real time 0.07 seconds
cpu time 0.01 seconds
107
108
109 data want2;
110 array t (9999)_temporary_;
111 array j(9999)_temporary_;
112 do until(last.quarter2);
113 set have;
114 by permno1 year2 quarter2;
115 if not missing(username1) then t(username1)=value2;
116 if not missing(analys) then j(analys)=value1;
117 avg1=mean(of t(*));
118 avg2=mean(of j(*));
119
120 output;
121 end;
122 call missing(of t(*),of j(*));
123
124 run;
ERROR: Array subscript out of range at line 116 column 29.
last.quarter2=0 obs=1 ANALYS=73367 year2=2017 quarter2=1 est_date3=20799 value1=0.09 value2=.
permno1=24643 username1=. FIRST.permno1=1 LAST.permno1=0 FIRST.year2=1 LAST.year2=0 FIRST.quarter2=1
avg1=. avg2=. _ERROR_=1 _N_=1
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 2 observations read from the data set WORK.HAVE.
WARNING: The data set WORK.WANT2 may be incomplete. When this step was stopped there were 0
observations and 11 variables.
NOTE: DATA statement used (Total process time):
real time 0.04 seconds
cpu time 0.03 seconds
Hi @Jacob3 Now you are talking my friend. Both your anaylse an username have values that are not in sequential numbers like your original and those are some large numbers , so of course array index presumption didn't work.
So one way is to simply raise the array subscript value which I did to 999999 and calculate avgs for non missings, so here it is. Let me know if it is getting close. Temp array is by far the fastest and arguably the best approach, but if the index values(of username and analyse) are awefully large and alphanumeric, we could jump to Hash approach. But try this and let me know
Also, Your latest reads all as numeric, but this conversion from char to num is no biggie as long as your analyse/username are not alphanumeric. If alphanumeric, Hash would be our sweetheart to date.
data have;
input obs ANALYS year2 quarter2 est_date3 value1 value2 permno1 username1;
cards;
1 073367 2017 1 20799 0.090 . 24643 .
2 076852 2017 1 20807 0.380 . 24643 .
3 114449 2017 1 20808 0.456 . 24643 .
4 073867 2017 1 20808 -0.170 . 24643 .
5 130621 2017 1 20823 0.370 . 24643 .
6 125794 2017 1 20824 0.030 . 24643 .
7 115965 2017 1 20828 0.040 . 24643 .
8 114449 2017 1 20835 0.211 . 24643 .
9 109116 2017 1 20836 0.300 . 24643 .
10 125794 2017 1 20837 0.210 . 24643 .
11 073367 2017 1 20837 -0.120 . 24643 .
12 073367 2017 1 20837 -0.090 . 24643 .
13 073867 2017 1 20838 0.190 . 24643 .
14 . 2017 1 20842 . 0.20 24643 372497
15 . 2017 1 20842 . 0.19 24643 379066
16 . 2017 1 20842 . 0.20 24643 51442
17 . 2017 1 20843 . 0.19 24643 368592
18 . 2017 1 20843 . 0.20 24643 372497
19 . 2017 1 20843 . 0.22 24643 68250
20 109116 2017 2 20843 0.360 . 24643 .
21 115965 2017 2 20844 0.300 . 24643 .
22 073867 2017 2 20844 0.840 . 24643 .
23 130621 2017 2 20846 0.540 . 24643 .
24 125794 2017 2 20849 0.460 . 24643 .
25 076852 2017 2 20850 0.230 . 24643 .
;
run;
proc contents data=have;
run;
data want2;
array t (999999)_temporary_;
array j(999999)_temporary_;
do until(last.quarter2);
set have;
by permno1 year2 quarter2;
if not missing(username1) then t(username1)=value2;
if not missing(analys) then j(analys)=value1;
if n(of t(*))>0 then avg1=mean(of t(*));
if n(of j(*))>0 then avg2=mean(of j(*));
output;
end;
call missing(of t(*),of j(*));
run;
PS
Always post the nearest fake sample, otherwise it's gonna be a ding dong. The pendulum or seesaw never stops
It took about an hour to run, but....mission accomplished. Thank you so much for your help!
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.