BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Jacob3
Calcite | Level 5

I have a dataset of analyst forecasts. I am wanting to produce a rolling average for each firm. Normally, this would be simple and I could accomplish using EXPAND or using an accumulator variable and dividing by the count. However, In my case an analyst can make more than one forecast in a period. If they do, I only want one of their forecasts to go into the rolling average at a time. I am copying below an example of the data.

 

DATEFIRMIDANALYSTIDESTIMATEAverage(want)
1/1/2015110.50.5
1/1/2015120.60.55
1/4/2015130.70.6
1/10/2015120.80.666666667
1/12/2015130.90.733333333
1/12/20151110.9

 

Up until (and including) 1/4/2015 it is a normal rolling average. However, starting with 1/10/2015 previous estimates drop out of the average. Specifically, the 1/1/2015 estimate by analyst 2 should no longer be included once the 1/10/2015 estimate by analyst 2 is now included.

 

The data I copied in is just an example. In reality, it goes across many years, firms, and analysts. I would want the average to reset for each firm-quarter. However, if I can get the logic for my example data, I believe I can extend it to the broader dataset.

 

I appreciate any help you have to offer.

1 ACCEPTED SOLUTION

Accepted Solutions
novinosrin
Tourmaline | Level 20

Hi @Jacob3  Now you are talking my friend. Both your anaylse an username have values that are not in sequential numbers like your original and those are some large numbers , so of course array index presumption didn't work.

 

So one way is to simply raise the array subscript value which I did to 999999 and calculate avgs for non missings, so here it is. Let me know if it is getting close. Temp array is by far the fastest and arguably the best approach, but if the index values(of username and analyse) are awefully large and alphanumeric, we could jump to Hash approach. But try this and let me know

 

Also, Your latest reads all as numeric, but this conversion from char to num is no biggie as long as your analyse/username are not alphanumeric. If alphanumeric, Hash would be our sweetheart to date.

 



data have;
input obs ANALYS year2 quarter2 est_date3 value1 value2 permno1 username1;
cards;
1 073367 2017 1 20799 0.090 . 24643 . 
2 076852 2017 1 20807 0.380 . 24643 . 
3 114449 2017 1 20808 0.456 . 24643 . 
4 073867 2017 1 20808 -0.170 . 24643 . 
5 130621 2017 1 20823 0.370 . 24643 . 
6 125794 2017 1 20824 0.030 . 24643 . 
7 115965 2017 1 20828 0.040 . 24643 . 
8 114449 2017 1 20835 0.211 . 24643 . 
9 109116 2017 1 20836 0.300 . 24643 . 
10 125794 2017 1 20837 0.210 . 24643 . 
11 073367 2017 1 20837 -0.120 . 24643 . 
12 073367 2017 1 20837 -0.090 . 24643 . 
13 073867 2017 1 20838 0.190 . 24643 . 
14 . 2017 1 20842 . 0.20 24643 372497 
15 . 2017 1 20842 . 0.19 24643 379066 
16 . 2017 1 20842 . 0.20 24643 51442 
17 . 2017 1 20843 . 0.19 24643 368592 
18 . 2017 1 20843 . 0.20 24643 372497 
19 . 2017 1 20843 . 0.22 24643 68250 
20 109116 2017 2 20843 0.360 . 24643 . 
21 115965 2017 2 20844 0.300 . 24643 . 
22 073867 2017 2 20844 0.840 . 24643 . 
23 130621 2017 2 20846 0.540 . 24643 . 
24 125794 2017 2 20849 0.460 . 24643 . 
25 076852 2017 2 20850 0.230 . 24643 . 
;
run;

proc contents data=have;
run;


data want2;
array t (999999)_temporary_;
array j(999999)_temporary_;
do until(last.quarter2);
set have;
by permno1 year2 quarter2;
if not missing(username1) then t(username1)=value2;
if not missing(analys) then j(analys)=value1;
if n(of t(*))>0 then avg1=mean(of t(*));
if n(of j(*))>0 then avg2=mean(of j(*));

output;
end;
call missing(of t(*),of j(*));

run;

PS

 

Always post the nearest fake sample, otherwise it's gonna be a ding dong. The pendulum or seesaw never stops

 

 

View solution in original post

10 REPLIES 10
novinosrin
Tourmaline | Level 20

I'd suggest post your real data(mocked of course) to challenge the community 🙂

 

For your sample

 

data have;
input DATE :mmddyy10.	FIRMID	ANALYSTID	ESTIMATE	;
format date mmddyy10.;
cards;
1/1/2015	1	1	0.5	0.5
1/1/2015	1	2	0.6	0.55
1/4/2015	1	3	0.7	0.6
1/10/2015	1	2	0.8	0.666666667
1/12/2015	1	3	0.9	0.733333333
1/12/2015	1	1	1	0.9
;

data want;
array t (9999)_temporary_;
do until(last.firmid);
set have;
by FIRMID;
t(analystid)=estimate;
avg=mean(of t(*));
output;
end;
call missing(of t(*));
run;
Jacob3
Calcite | Level 5

Thanks for your reply novinosrin!

 

The real data is of the form:

 

DATEYearQuarterFIRMIDANALYSTIDESTIMATEAverage(want)ANALYSTID2ESTIMATE2Average2(want)
1/1/201520151110.50.5100.60.6
1/1/201520151120.60.55..0.6
1/4/201520151130.70.6120.80.7
1/10/201520151120.80.6666666671310.8
1/12/201520151130.90.733333333100.90.9
1/12/2015201511110.9130.90.866666667
4/1/201520152110.50.5100.60.6
4/1/201520152120.60.55..0.6
4/4/201520152130.70.6120.80.7
4/10/201520152120.80.6666666671310.8
4/12/201520152130.90.733333333100.90.9
4/12/2015201521110.9130.90.866666667
7/1/201520153110.50.5100.60.6
7/1/201520153120.60.55..0.6
7/4/201520153130.70.6120.80.7
7/10/201520153120.80.6666666671310.8
7/12/201520153130.90.733333333100.90.9
7/12/2015201531110.9130.90.866666667
10/1/201520154110.50.5100.60.6
10/1/201520154120.60.55..0.6
10/4/201520154130.70.6120.80.7
10/10/201520154120.80.6666666671310.8
10/12/201520154130.90.733333333100.90.9
10/12/2015201541110.9130.90.866666667
1/1/201620161110.50.5100.60.6
1/1/201620161120.60.55..0.6
1/4/201620161130.70.6120.80.7
1/10/201620161120.80.6666666671310.8
1/12/201620161130.90.733333333100.90.9
1/12/2016201611110.9130.90.866666667
1/1/201520151210.50.5100.60.6
1/1/201520151220.60.55..0.6
1/4/201520151230.70.6120.80.7
1/10/201520151220.80.6666666671310.8
1/12/201520151230.90.733333333100.90.9
1/12/2015201512110.9130.90.866666667

 

Where I want the average to be computed within year quarter and firm for two separate types of analysts (analystid & analystid2), while omitting past values of estimates from the moving average when the same analyst makes a new estimate.

novinosrin
Tourmaline | Level 20

Hello @Jacob3  Still the same logic, HTH

 

data have(drop=avg:) expected_want;
input DATE :mmddyy10. Year	Quarter	FIRMID	ANALYSTID	ESTIMATE avg1  ANALYSTID2	ESTIMATE2 avg2;
format date mmddyy10.;
cards;
1/1/2015	2015	1	1	1	0.5	0.5	10	0.6	0.6
1/1/2015	2015	1	1	2	0.6	0.55	.	.	0.6
1/4/2015	2015	1	1	3	0.7	0.6	12	0.8	0.7
1/10/2015	2015	1	1	2	0.8	0.666666667	13	1	0.8
1/12/2015	2015	1	1	3	0.9	0.733333333	10	0.9	0.9
1/12/2015	2015	1	1	1	1	0.9	13	0.9	0.866666667
4/1/2015	2015	2	1	1	0.5	0.5	10	0.6	0.6
4/1/2015	2015	2	1	2	0.6	0.55	.	.	0.6
4/4/2015	2015	2	1	3	0.7	0.6	12	0.8	0.7
4/10/2015	2015	2	1	2	0.8	0.666666667	13	1	0.8
4/12/2015	2015	2	1	3	0.9	0.733333333	10	0.9	0.9
4/12/2015	2015	2	1	1	1	0.9	13	0.9	0.866666667
7/1/2015	2015	3	1	1	0.5	0.5	10	0.6	0.6
7/1/2015	2015	3	1	2	0.6	0.55	.	.	0.6
7/4/2015	2015	3	1	3	0.7	0.6	12	0.8	0.7
7/10/2015	2015	3	1	2	0.8	0.666666667	13	1	0.8
7/12/2015	2015	3	1	3	0.9	0.733333333	10	0.9	0.9
7/12/2015	2015	3	1	1	1	0.9	13	0.9	0.866666667
10/1/2015	2015	4	1	1	0.5	0.5	10	0.6	0.6
10/1/2015	2015	4	1	2	0.6	0.55	.	.	0.6
10/4/2015	2015	4	1	3	0.7	0.6	12	0.8	0.7
10/10/2015	2015	4	1	2	0.8	0.666666667	13	1	0.8
10/12/2015	2015	4	1	3	0.9	0.733333333	10	0.9	0.9
10/12/2015	2015	4	1	1	1	0.9	13	0.9	0.866666667
1/1/2016	2016	1	1	1	0.5	0.5	10	0.6	0.6
1/1/2016	2016	1	1	2	0.6	0.55	.	.	0.6
1/4/2016	2016	1	1	3	0.7	0.6	12	0.8	0.7
1/10/2016	2016	1	1	2	0.8	0.666666667	13	1	0.8
1/12/2016	2016	1	1	3	0.9	0.733333333	10	0.9	0.9
1/12/2016	2016	1	1	1	1	0.9	13	0.9	0.866666667
1/1/2015	2015	1	2	1	0.5	0.5	10	0.6	0.6
1/1/2015	2015	1	2	2	0.6	0.55	.	.	0.6
1/4/2015	2015	1	2	3	0.7	0.6	12	0.8	0.7
1/10/2015	2015	1	2	2	0.8	0.666666667	13	1	0.8
1/12/2015	2015	1	2	3	0.9	0.733333333	10	0.9	0.9
1/12/2015	2015	1	2	1	1	0.9	13	0.9	0.866666667
;

data want;
array t (9999)_temporary_;
array j(9999)_temporary_;
do until(last.quarter);
set have;
by FIRMID year quarter;
if not missing(analystid) then t(analystid)=estimate;
if not missing(analystid2) then j(analystid2)=estimate2;
avg1=mean(of t(*));
avg2=mean(of j(*));
output;
end;
call missing(of t(*),of j(*));
run;

 

Jacob3
Calcite | Level 5

Sorry to bother, but I keep getting "ERROR: Array subscript out of range". I can run the code 2 times back to back and it will give me different lines at which this error is occurring.  Do you have any advice on how to remedy this?

novinosrin
Tourmaline | Level 20

No bother at all. If your sample is truly a good representative one, that should work as it does for the test have and want. I'll try to look into it tomorrow

Jacob3
Calcite | Level 5

Thanks again.

 

The only things I can discern that are different between my sample and "have" is that firmid and analystid2 are character variables in my dataset.

novinosrin
Tourmaline | Level 20

Ah ok @Jacob3 

 

Please do me a favor by thoroughly checking variable attributes and the values of a few records and message back plz

Jacob3
Calcite | Level 5

I kept the first 25 observations from my dataset and read it into sas inline. The below does not work.

data have;
input obs ANALYS year2 quarter2 est_date3 value1 value2 permno1 username1;

cards;
 
1 073367 2017 1 20799 0.090 . 24643 . 
2 076852 2017 1 20807 0.380 . 24643 . 
3 114449 2017 1 20808 0.456 . 24643 . 
4 073867 2017 1 20808 -0.170 . 24643 . 
5 130621 2017 1 20823 0.370 . 24643 . 
6 125794 2017 1 20824 0.030 . 24643 . 
7 115965 2017 1 20828 0.040 . 24643 . 
8 114449 2017 1 20835 0.211 . 24643 . 
9 109116 2017 1 20836 0.300 . 24643 . 
10 125794 2017 1 20837 0.210 . 24643 . 
11 073367 2017 1 20837 -0.120 . 24643 . 
12 073367 2017 1 20837 -0.090 . 24643 . 
13 073867 2017 1 20838 0.190 . 24643 . 
14 . 2017 1 20842 . 0.20 24643 372497 
15 . 2017 1 20842 . 0.19 24643 379066 
16 . 2017 1 20842 . 0.20 24643 51442 
17 . 2017 1 20843 . 0.19 24643 368592 
18 . 2017 1 20843 . 0.20 24643 372497 
19 . 2017 1 20843 . 0.22 24643 68250 
20 109116 2017 2 20843 0.360 . 24643 . 
21 115965 2017 2 20844 0.300 . 24643 . 
22 073867 2017 2 20844 0.840 . 24643 . 
23 130621 2017 2 20846 0.540 . 24643 . 
24 125794 2017 2 20849 0.460 . 24643 . 
25 076852 2017 2 20850 0.230 . 24643 . 
;
run;

proc contents data=have;
run;


data want2;
array t (9999)_temporary_;
array j(9999)_temporary_;
do until(last.quarter2);
set have;
by permno1 year2 quarter2;
if not missing(username1) then t(username1)=value2;
if not missing(analys) then j(analys)=value1;
avg1=mean(of t(*));
avg2=mean(of j(*));

output;
end;
call missing(of t(*),of j(*));

run;

 

Below I am attaching my log file in case that helps. I am probably missing something silly.

 

71
72 data have;
73 input obs ANALYS year2 quarter2 est_date3 value1 value2 permno1 username1;
74
75 cards;

NOTE: SAS went to a new line when INPUT statement reached past the end of a line.
NOTE: The data set WORK.HAVE has 25 observations and 9 variables.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.00 seconds


102 ;
103 run;
104
105 proc contents data=have;
106 run;

NOTE: PROCEDURE CONTENTS used (Total process time):
real time 0.07 seconds
cpu time 0.01 seconds


107
108
109 data want2;
110 array t (9999)_temporary_;
111 array j(9999)_temporary_;
112 do until(last.quarter2);
113 set have;
114 by permno1 year2 quarter2;
115 if not missing(username1) then t(username1)=value2;
116 if not missing(analys) then j(analys)=value1;
117 avg1=mean(of t(*));
118 avg2=mean(of j(*));
119
120 output;
121 end;
122 call missing(of t(*),of j(*));
123
124 run;

ERROR: Array subscript out of range at line 116 column 29.
last.quarter2=0 obs=1 ANALYS=73367 year2=2017 quarter2=1 est_date3=20799 value1=0.09 value2=.
permno1=24643 username1=. FIRST.permno1=1 LAST.permno1=0 FIRST.year2=1 LAST.year2=0 FIRST.quarter2=1
avg1=. avg2=. _ERROR_=1 _N_=1
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 2 observations read from the data set WORK.HAVE.
WARNING: The data set WORK.WANT2 may be incomplete. When this step was stopped there were 0
observations and 11 variables.
NOTE: DATA statement used (Total process time):
real time 0.04 seconds
cpu time 0.03 seconds

 

 

novinosrin
Tourmaline | Level 20

Hi @Jacob3  Now you are talking my friend. Both your anaylse an username have values that are not in sequential numbers like your original and those are some large numbers , so of course array index presumption didn't work.

 

So one way is to simply raise the array subscript value which I did to 999999 and calculate avgs for non missings, so here it is. Let me know if it is getting close. Temp array is by far the fastest and arguably the best approach, but if the index values(of username and analyse) are awefully large and alphanumeric, we could jump to Hash approach. But try this and let me know

 

Also, Your latest reads all as numeric, but this conversion from char to num is no biggie as long as your analyse/username are not alphanumeric. If alphanumeric, Hash would be our sweetheart to date.

 



data have;
input obs ANALYS year2 quarter2 est_date3 value1 value2 permno1 username1;
cards;
1 073367 2017 1 20799 0.090 . 24643 . 
2 076852 2017 1 20807 0.380 . 24643 . 
3 114449 2017 1 20808 0.456 . 24643 . 
4 073867 2017 1 20808 -0.170 . 24643 . 
5 130621 2017 1 20823 0.370 . 24643 . 
6 125794 2017 1 20824 0.030 . 24643 . 
7 115965 2017 1 20828 0.040 . 24643 . 
8 114449 2017 1 20835 0.211 . 24643 . 
9 109116 2017 1 20836 0.300 . 24643 . 
10 125794 2017 1 20837 0.210 . 24643 . 
11 073367 2017 1 20837 -0.120 . 24643 . 
12 073367 2017 1 20837 -0.090 . 24643 . 
13 073867 2017 1 20838 0.190 . 24643 . 
14 . 2017 1 20842 . 0.20 24643 372497 
15 . 2017 1 20842 . 0.19 24643 379066 
16 . 2017 1 20842 . 0.20 24643 51442 
17 . 2017 1 20843 . 0.19 24643 368592 
18 . 2017 1 20843 . 0.20 24643 372497 
19 . 2017 1 20843 . 0.22 24643 68250 
20 109116 2017 2 20843 0.360 . 24643 . 
21 115965 2017 2 20844 0.300 . 24643 . 
22 073867 2017 2 20844 0.840 . 24643 . 
23 130621 2017 2 20846 0.540 . 24643 . 
24 125794 2017 2 20849 0.460 . 24643 . 
25 076852 2017 2 20850 0.230 . 24643 . 
;
run;

proc contents data=have;
run;


data want2;
array t (999999)_temporary_;
array j(999999)_temporary_;
do until(last.quarter2);
set have;
by permno1 year2 quarter2;
if not missing(username1) then t(username1)=value2;
if not missing(analys) then j(analys)=value1;
if n(of t(*))>0 then avg1=mean(of t(*));
if n(of j(*))>0 then avg2=mean(of j(*));

output;
end;
call missing(of t(*),of j(*));

run;

PS

 

Always post the nearest fake sample, otherwise it's gonna be a ding dong. The pendulum or seesaw never stops

 

 

Jacob3
Calcite | Level 5

 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 2338 views
  • 0 likes
  • 2 in conversation