Solved: Rolling Average by firm and analyst

Jacob3 · Posted 07-31-2019 10:25 AM

I have a dataset of analyst forecasts. I am wanting to produce a rolling average for each firm. Normally, this would be simple and I could accomplish using EXPAND or using an accumulator variable and dividing by the count. However, In my case an analyst can make more than one forecast in a period. If they do, I only want one of their forecasts to go into the rolling average at a time. I am copying below an example of the data.

DATE	FIRMID	ANALYSTID	ESTIMATE	Average(want)
1/1/2015	1	1	0.5	0.5
1/1/2015	1	2	0.6	0.55
1/4/2015	1	3	0.7	0.6
1/10/2015	1	2	0.8	0.666666667
1/12/2015	1	3	0.9	0.733333333
1/12/2015	1	1	1	0.9

Up until (and including) 1/4/2015 it is a normal rolling average. However, starting with 1/10/2015 previous estimates drop out of the average. Specifically, the 1/1/2015 estimate by analyst 2 should no longer be included once the 1/10/2015 estimate by analyst 2 is now included.

The data I copied in is just an example. In reality, it goes across many years, firms, and analysts. I would want the average to reset for each firm-quarter. However, if I can get the logic for my example data, I believe I can extend it to the broader dataset.

I appreciate any help you have to offer.

novinosrin · Posted 08-01-2019 10:35 AM

Hi @Jacob3 Now you are talking my friend. Both your anaylse an username have values that are not in sequential numbers like your original and those are some large numbers , so of course array index presumption didn't work.

So one way is to simply raise the array subscript value which I did to 999999 and calculate avgs for non missings, so here it is. Let me know if it is getting close. Temp array is by far the fastest and arguably the best approach, but if the index values(of username and analyse) are awefully large and alphanumeric, we could jump to Hash approach. But try this and let me know

Also, Your latest reads all as numeric, but this conversion from char to num is no biggie as long as your analyse/username are not alphanumeric. If alphanumeric, Hash would be our sweetheart to date.



data have;
input obs ANALYS year2 quarter2 est_date3 value1 value2 permno1 username1;
cards;
1 073367 2017 1 20799 0.090 . 24643 . 
2 076852 2017 1 20807 0.380 . 24643 . 
3 114449 2017 1 20808 0.456 . 24643 . 
4 073867 2017 1 20808 -0.170 . 24643 . 
5 130621 2017 1 20823 0.370 . 24643 . 
6 125794 2017 1 20824 0.030 . 24643 . 
7 115965 2017 1 20828 0.040 . 24643 . 
8 114449 2017 1 20835 0.211 . 24643 . 
9 109116 2017 1 20836 0.300 . 24643 . 
10 125794 2017 1 20837 0.210 . 24643 . 
11 073367 2017 1 20837 -0.120 . 24643 . 
12 073367 2017 1 20837 -0.090 . 24643 . 
13 073867 2017 1 20838 0.190 . 24643 . 
14 . 2017 1 20842 . 0.20 24643 372497 
15 . 2017 1 20842 . 0.19 24643 379066 
16 . 2017 1 20842 . 0.20 24643 51442 
17 . 2017 1 20843 . 0.19 24643 368592 
18 . 2017 1 20843 . 0.20 24643 372497 
19 . 2017 1 20843 . 0.22 24643 68250 
20 109116 2017 2 20843 0.360 . 24643 . 
21 115965 2017 2 20844 0.300 . 24643 . 
22 073867 2017 2 20844 0.840 . 24643 . 
23 130621 2017 2 20846 0.540 . 24643 . 
24 125794 2017 2 20849 0.460 . 24643 . 
25 076852 2017 2 20850 0.230 . 24643 . 
;
run;

proc contents data=have;
run;


data want2;
array t (999999)_temporary_;
array j(999999)_temporary_;
do until(last.quarter2);
set have;
by permno1 year2 quarter2;
if not missing(username1) then t(username1)=value2;
if not missing(analys) then j(analys)=value1;
if n(of t(*))>0 then avg1=mean(of t(*));
if n(of j(*))>0 then avg2=mean(of j(*));

output;
end;
call missing(of t(*),of j(*));

run;

PS

Always post the nearest fake sample, otherwise it's gonna be a ding dong. The pendulum or seesaw never stops

View solution in original post

novinosrin · Posted 07-31-2019 10:40 AM

I'd suggest post your real data(mocked of course) to challenge the community 🙂

For your sample

data have;
input DATE :mmddyy10.	FIRMID	ANALYSTID	ESTIMATE	;
format date mmddyy10.;
cards;
1/1/2015	1	1	0.5	0.5
1/1/2015	1	2	0.6	0.55
1/4/2015	1	3	0.7	0.6
1/10/2015	1	2	0.8	0.666666667
1/12/2015	1	3	0.9	0.733333333
1/12/2015	1	1	1	0.9
;

data want;
array t (9999)_temporary_;
do until(last.firmid);
set have;
by FIRMID;
t(analystid)=estimate;
avg=mean(of t(*));
output;
end;
call missing(of t(*));
run;

Jacob3 · Posted 07-31-2019 12:50 PM

Thanks for your reply novinosrin!

The real data is of the form:

DATE	Year	Quarter	FIRMID	ANALYSTID	ESTIMATE	Average(want)	ANALYSTID2	ESTIMATE2	Average2(want)
1/1/2015	2015	1	1	1	0.5	0.5	10	0.6	0.6
1/1/2015	2015	1	1	2	0.6	0.55	.	.	0.6
1/4/2015	2015	1	1	3	0.7	0.6	12	0.8	0.7
1/10/2015	2015	1	1	2	0.8	0.666666667	13	1	0.8
1/12/2015	2015	1	1	3	0.9	0.733333333	10	0.9	0.9
1/12/2015	2015	1	1	1	1	0.9	13	0.9	0.866666667
4/1/2015	2015	2	1	1	0.5	0.5	10	0.6	0.6
4/1/2015	2015	2	1	2	0.6	0.55	.	.	0.6
4/4/2015	2015	2	1	3	0.7	0.6	12	0.8	0.7
4/10/2015	2015	2	1	2	0.8	0.666666667	13	1	0.8
4/12/2015	2015	2	1	3	0.9	0.733333333	10	0.9	0.9
4/12/2015	2015	2	1	1	1	0.9	13	0.9	0.866666667
7/1/2015	2015	3	1	1	0.5	0.5	10	0.6	0.6
7/1/2015	2015	3	1	2	0.6	0.55	.	.	0.6
7/4/2015	2015	3	1	3	0.7	0.6	12	0.8	0.7
7/10/2015	2015	3	1	2	0.8	0.666666667	13	1	0.8
7/12/2015	2015	3	1	3	0.9	0.733333333	10	0.9	0.9
7/12/2015	2015	3	1	1	1	0.9	13	0.9	0.866666667
10/1/2015	2015	4	1	1	0.5	0.5	10	0.6	0.6
10/1/2015	2015	4	1	2	0.6	0.55	.	.	0.6
10/4/2015	2015	4	1	3	0.7	0.6	12	0.8	0.7
10/10/2015	2015	4	1	2	0.8	0.666666667	13	1	0.8
10/12/2015	2015	4	1	3	0.9	0.733333333	10	0.9	0.9
10/12/2015	2015	4	1	1	1	0.9	13	0.9	0.866666667
1/1/2016	2016	1	1	1	0.5	0.5	10	0.6	0.6
1/1/2016	2016	1	1	2	0.6	0.55	.	.	0.6
1/4/2016	2016	1	1	3	0.7	0.6	12	0.8	0.7
1/10/2016	2016	1	1	2	0.8	0.666666667	13	1	0.8
1/12/2016	2016	1	1	3	0.9	0.733333333	10	0.9	0.9
1/12/2016	2016	1	1	1	1	0.9	13	0.9	0.866666667
1/1/2015	2015	1	2	1	0.5	0.5	10	0.6	0.6
1/1/2015	2015	1	2	2	0.6	0.55	.	.	0.6
1/4/2015	2015	1	2	3	0.7	0.6	12	0.8	0.7
1/10/2015	2015	1	2	2	0.8	0.666666667	13	1	0.8
1/12/2015	2015	1	2	3	0.9	0.733333333	10	0.9	0.9
1/12/2015	2015	1	2	1	1	0.9	13	0.9	0.866666667

Where I want the average to be computed within year quarter and firm for two separate types of analysts (analystid & analystid2), while omitting past values of estimates from the moving average when the same analyst makes a new estimate.

novinosrin · Posted 07-31-2019 01:14 PM

Hello @Jacob3 Still the same logic, HTH

data have(drop=avg:) expected_want;
input DATE :mmddyy10. Year	Quarter	FIRMID	ANALYSTID	ESTIMATE avg1  ANALYSTID2	ESTIMATE2 avg2;
format date mmddyy10.;
cards;
1/1/2015	2015	1	1	1	0.5	0.5	10	0.6	0.6
1/1/2015	2015	1	1	2	0.6	0.55	.	.	0.6
1/4/2015	2015	1	1	3	0.7	0.6	12	0.8	0.7
1/10/2015	2015	1	1	2	0.8	0.666666667	13	1	0.8
1/12/2015	2015	1	1	3	0.9	0.733333333	10	0.9	0.9
1/12/2015	2015	1	1	1	1	0.9	13	0.9	0.866666667
4/1/2015	2015	2	1	1	0.5	0.5	10	0.6	0.6
4/1/2015	2015	2	1	2	0.6	0.55	.	.	0.6
4/4/2015	2015	2	1	3	0.7	0.6	12	0.8	0.7
4/10/2015	2015	2	1	2	0.8	0.666666667	13	1	0.8
4/12/2015	2015	2	1	3	0.9	0.733333333	10	0.9	0.9
4/12/2015	2015	2	1	1	1	0.9	13	0.9	0.866666667
7/1/2015	2015	3	1	1	0.5	0.5	10	0.6	0.6
7/1/2015	2015	3	1	2	0.6	0.55	.	.	0.6
7/4/2015	2015	3	1	3	0.7	0.6	12	0.8	0.7
7/10/2015	2015	3	1	2	0.8	0.666666667	13	1	0.8
7/12/2015	2015	3	1	3	0.9	0.733333333	10	0.9	0.9
7/12/2015	2015	3	1	1	1	0.9	13	0.9	0.866666667
10/1/2015	2015	4	1	1	0.5	0.5	10	0.6	0.6
10/1/2015	2015	4	1	2	0.6	0.55	.	.	0.6
10/4/2015	2015	4	1	3	0.7	0.6	12	0.8	0.7
10/10/2015	2015	4	1	2	0.8	0.666666667	13	1	0.8
10/12/2015	2015	4	1	3	0.9	0.733333333	10	0.9	0.9
10/12/2015	2015	4	1	1	1	0.9	13	0.9	0.866666667
1/1/2016	2016	1	1	1	0.5	0.5	10	0.6	0.6
1/1/2016	2016	1	1	2	0.6	0.55	.	.	0.6
1/4/2016	2016	1	1	3	0.7	0.6	12	0.8	0.7
1/10/2016	2016	1	1	2	0.8	0.666666667	13	1	0.8
1/12/2016	2016	1	1	3	0.9	0.733333333	10	0.9	0.9
1/12/2016	2016	1	1	1	1	0.9	13	0.9	0.866666667
1/1/2015	2015	1	2	1	0.5	0.5	10	0.6	0.6
1/1/2015	2015	1	2	2	0.6	0.55	.	.	0.6
1/4/2015	2015	1	2	3	0.7	0.6	12	0.8	0.7
1/10/2015	2015	1	2	2	0.8	0.666666667	13	1	0.8
1/12/2015	2015	1	2	3	0.9	0.733333333	10	0.9	0.9
1/12/2015	2015	1	2	1	1	0.9	13	0.9	0.866666667
;

data want;
array t (9999)_temporary_;
array j(9999)_temporary_;
do until(last.quarter);
set have;
by FIRMID year quarter;
if not missing(analystid) then t(analystid)=estimate;
if not missing(analystid2) then j(analystid2)=estimate2;
avg1=mean(of t(*));
avg2=mean(of j(*));
output;
end;
call missing(of t(*),of j(*));
run;

Jacob3 · Posted 07-31-2019 05:02 PM

Sorry to bother, but I keep getting "ERROR: Array subscript out of range". I can run the code 2 times back to back and it will give me different lines at which this error is occurring. Do you have any advice on how to remedy this?

novinosrin · Posted 07-31-2019 09:00 PM

No bother at all. If your sample is truly a good representative one, that should work as it does for the test have and want. I'll try to look into it tomorrow

Jacob3 · Posted 08-01-2019 09:00 AM

Thanks again.

The only things I can discern that are different between my sample and "have" is that firmid and analystid2 are character variables in my dataset.

novinosrin · Posted 08-01-2019 09:32 AM

Ah ok @Jacob3

Please do me a favor by thoroughly checking variable attributes and the values of a few records and message back plz

Jacob3 · Posted 08-01-2019 10:16 AM

I kept the first 25 observations from my dataset and read it into sas inline. The below does not work.

data have;
input obs ANALYS year2 quarter2 est_date3 value1 value2 permno1 username1;

cards;
 
1 073367 2017 1 20799 0.090 . 24643 . 
2 076852 2017 1 20807 0.380 . 24643 . 
3 114449 2017 1 20808 0.456 . 24643 . 
4 073867 2017 1 20808 -0.170 . 24643 . 
5 130621 2017 1 20823 0.370 . 24643 . 
6 125794 2017 1 20824 0.030 . 24643 . 
7 115965 2017 1 20828 0.040 . 24643 . 
8 114449 2017 1 20835 0.211 . 24643 . 
9 109116 2017 1 20836 0.300 . 24643 . 
10 125794 2017 1 20837 0.210 . 24643 . 
11 073367 2017 1 20837 -0.120 . 24643 . 
12 073367 2017 1 20837 -0.090 . 24643 . 
13 073867 2017 1 20838 0.190 . 24643 . 
14 . 2017 1 20842 . 0.20 24643 372497 
15 . 2017 1 20842 . 0.19 24643 379066 
16 . 2017 1 20842 . 0.20 24643 51442 
17 . 2017 1 20843 . 0.19 24643 368592 
18 . 2017 1 20843 . 0.20 24643 372497 
19 . 2017 1 20843 . 0.22 24643 68250 
20 109116 2017 2 20843 0.360 . 24643 . 
21 115965 2017 2 20844 0.300 . 24643 . 
22 073867 2017 2 20844 0.840 . 24643 . 
23 130621 2017 2 20846 0.540 . 24643 . 
24 125794 2017 2 20849 0.460 . 24643 . 
25 076852 2017 2 20850 0.230 . 24643 . 
;
run;

proc contents data=have;
run;


data want2;
array t (9999)_temporary_;
array j(9999)_temporary_;
do until(last.quarter2);
set have;
by permno1 year2 quarter2;
if not missing(username1) then t(username1)=value2;
if not missing(analys) then j(analys)=value1;
avg1=mean(of t(*));
avg2=mean(of j(*));

output;
end;
call missing(of t(*),of j(*));

run;

Below I am attaching my log file in case that helps. I am probably missing something silly.

71
72 data have;
73 input obs ANALYS year2 quarter2 est_date3 value1 value2 permno1 username1;
74
75 cards;

NOTE: SAS went to a new line when INPUT statement reached past the end of a line.
NOTE: The data set WORK.HAVE has 25 observations and 9 variables.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.00 seconds

102 ;
103 run;
104
105 proc contents data=have;
106 run;

NOTE: PROCEDURE CONTENTS used (Total process time):
real time 0.07 seconds
cpu time 0.01 seconds

107
108
109 data want2;
110 array t (9999)_temporary_;
111 array j(9999)_temporary_;
112 do until(last.quarter2);
113 set have;
114 by permno1 year2 quarter2;
115 if not missing(username1) then t(username1)=value2;
116 if not missing(analys) then j(analys)=value1;
117 avg1=mean(of t(*));
118 avg2=mean(of j(*));
119
120 output;
121 end;
122 call missing(of t(*),of j(*));
123
124 run;

ERROR: Array subscript out of range at line 116 column 29.
last.quarter2=0 obs=1 ANALYS=73367 year2=2017 quarter2=1 est_date3=20799 value1=0.09 value2=.
permno1=24643 username1=. FIRST.permno1=1 LAST.permno1=0 FIRST.year2=1 LAST.year2=0 FIRST.quarter2=1
avg1=. avg2=. _ERROR_=1 _N_=1
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 2 observations read from the data set WORK.HAVE.
WARNING: The data set WORK.WANT2 may be incomplete. When this step was stopped there were 0
observations and 11 variables.
NOTE: DATA statement used (Total process time):
real time 0.04 seconds
cpu time 0.03 seconds

novinosrin · Posted 08-01-2019 10:35 AM

Hi @Jacob3 Now you are talking my friend. Both your anaylse an username have values that are not in sequential numbers like your original and those are some large numbers , so of course array index presumption didn't work.

So one way is to simply raise the array subscript value which I did to 999999 and calculate avgs for non missings, so here it is. Let me know if it is getting close. Temp array is by far the fastest and arguably the best approach, but if the index values(of username and analyse) are awefully large and alphanumeric, we could jump to Hash approach. But try this and let me know

Also, Your latest reads all as numeric, but this conversion from char to num is no biggie as long as your analyse/username are not alphanumeric. If alphanumeric, Hash would be our sweetheart to date.



data have;
input obs ANALYS year2 quarter2 est_date3 value1 value2 permno1 username1;
cards;
1 073367 2017 1 20799 0.090 . 24643 . 
2 076852 2017 1 20807 0.380 . 24643 . 
3 114449 2017 1 20808 0.456 . 24643 . 
4 073867 2017 1 20808 -0.170 . 24643 . 
5 130621 2017 1 20823 0.370 . 24643 . 
6 125794 2017 1 20824 0.030 . 24643 . 
7 115965 2017 1 20828 0.040 . 24643 . 
8 114449 2017 1 20835 0.211 . 24643 . 
9 109116 2017 1 20836 0.300 . 24643 . 
10 125794 2017 1 20837 0.210 . 24643 . 
11 073367 2017 1 20837 -0.120 . 24643 . 
12 073367 2017 1 20837 -0.090 . 24643 . 
13 073867 2017 1 20838 0.190 . 24643 . 
14 . 2017 1 20842 . 0.20 24643 372497 
15 . 2017 1 20842 . 0.19 24643 379066 
16 . 2017 1 20842 . 0.20 24643 51442 
17 . 2017 1 20843 . 0.19 24643 368592 
18 . 2017 1 20843 . 0.20 24643 372497 
19 . 2017 1 20843 . 0.22 24643 68250 
20 109116 2017 2 20843 0.360 . 24643 . 
21 115965 2017 2 20844 0.300 . 24643 . 
22 073867 2017 2 20844 0.840 . 24643 . 
23 130621 2017 2 20846 0.540 . 24643 . 
24 125794 2017 2 20849 0.460 . 24643 . 
25 076852 2017 2 20850 0.230 . 24643 . 
;
run;

proc contents data=have;
run;


data want2;
array t (999999)_temporary_;
array j(999999)_temporary_;
do until(last.quarter2);
set have;
by permno1 year2 quarter2;
if not missing(username1) then t(username1)=value2;
if not missing(analys) then j(analys)=value1;
if n(of t(*))>0 then avg1=mean(of t(*));
if n(of j(*))>0 then avg2=mean(of j(*));

output;
end;
call missing(of t(*),of j(*));

run;

PS

Always post the nearest fake sample, otherwise it's gonna be a ding dong. The pendulum or seesaw never stops

Jacob3 · Posted 08-01-2019 11:54 AM

novinosrin,

It took about an hour to run, but....mission accomplished. Thank you so much for your help!

Rolling Average by firm and analyst

Re: Rolling Average by firm and analyst

Re: Rolling Average by firm and analyst

Re: Rolling Average by firm and analyst

Re: Rolling Average by firm and analyst

Re: Rolling Average by firm and analyst

Re: Rolling Average by firm and analyst

Re: Rolling Average by firm and analyst

Re: Rolling Average by firm and analyst

Re: Rolling Average by firm and analyst

Re: Rolling Average by firm and analyst

Re: Rolling Average by firm and analyst

Registration is open

Registration is open

SAS Training: Just a Click Away