I still don't know if I fully understand the problem you are trying to solve, but the following matches your test data results:
proc sql noprint;
create table want as
select distinct a.portid,a.firmid,a.myear,a.return,b.sumreturn, b.cntreturn,bportid
from return as a
left join
(select portid as bportid, myear, sum(return) as sumreturn, count(return) as cntreturn
from return group by myear,bportid) as b
on a.myear=b.myear
where a.portid ne bportid
order by firmid,myear,portid
;
quit;
data want (drop=next_measure prev_measure measure);
set want;
by firmid myear;
set want ( firstobs = 2 keep = return rename = (return = Next_Measure) )
want ( obs = 1 drop = _all_ );
Prev_Measure = ifn( first.myear, (.), lag(return) );
Next_Measure = ifn( last.myear, (.), Next_Measure );
if not (first.myear and last.myear) then do;
if first.myear then measure=next_measure;
else measure=prev_measure;
sumreturn=sumreturn-measure;
cntreturn=cntreturn-1;
end;
average=sumreturn/cntreturn;
run;
proc sort data=want;
by myear portid firmid;
run;
Thank you Arthur (art297) for spending your valuable time helping me out. Your updated version indeed generates the desirable output. However, it will take me a while to digest what exactly is going on in the code and to make sure that I can comfortably apply to my large data set. I am really sorry that I have not made my inquiry very clear. Maybe the story below may shed some light:
Imagine that you have two portfolios (portid 1 and 2) and company A is in both portid 1 and portid2. Portid 1 also contains Companies B and C. Portid 2 contains Companies D and E. In short,
Portid 1: A, B, C
Portid 2: A, D, E
I am interested in calculating an average value for each company by averaging returns of companies in other portfolios. For example,
average for C: (A+D+E)/3, all of which come from portid 2 (A is common)
average for E: (A+B+C)/3, all of which come from portid 1
average for A in portid 1= (D+E)/2
average for A in portid 2=(B+C)/2
Overall, I have two rules: a company's average return must exclude its own return (as in A) and must only include returns from other portfolios.
I hope my objective is clearer now.
As long as your real data is limited to two portfolios I think the code will work. In the case where there is only one portfolio, it simply takes the average. In the case where there are two portfolios, it gets the revenue to removed from the other portfolio.
Thank you for your further explanations. In this project, I have 50 portfolios, but sometimes it gets as high as 400. I should then be careful about the portfolio size.
If you post an example with 3 portfolios, and the results you want, we can test whether the methodology holds.
I just realized that my program also does not work when I have more than two portfolios. See the data set below. For example, for portid=6 and firmid=620, the average should be calculated averaging returns all of companies in portid 1 through portid 5 (where myear=200302), but all of the suggestions seem to calculate 4 or 5 different average for portid=6 and firmid=620.
DATA return;
INPUT portid firmid myear return;
DATALINES;
1 100 200301 2.11
1 120 200301 3.12
1 130 200301 2.13
1 140 200301 3.67
1 100 200302 5.23
1 120 200302 6.81
1 130 200302 2.55
1 140 200302 4.21
1 150 200302 3.21
2 300 200301 1.11
2 320 200301 0.12
2 330 200301 -2.13
2 340 200301 -1.67
2 300 200302 4.23
2 320 200302 3.81
2 330 200302 1.55
2 340 200302 2.21
2 350 200302 4.23
2 100 200302 5.23
3 56 200301 3.11
3 100 200301 2.11
3 763 200301 5.13
3 412 200301 4.67
3 124 200302 3.23
3 97 200302 3.81
3 195 200302 1.55
4 740 200302 0.21
4 250 200302 -1.21
4 812 200301 -1.11
4 823 200301 7.12
5 835 200301 -3.33
5 947 200301 -4.67
5 980 200302 4.23
6 620 200302 3.81
;
run;
:smileygrin: funny.
There was a mistake in my decision, but it still works for your problem.
Here step by step (so it'll be easier to check the correctness of all steps):
Just copypaste and run it.
DATA return;
INPUT portid firmid myear return;
DATALINES;
1 100 200301 2.11
1 120 200301 3.12
1 130 200301 2.13
1 140 200301 3.67
1 100 200302 5.23
1 120 200302 6.81
1 130 200302 2.55
1 140 200302 4.21
1 150 200302 3.21
2 300 200301 1.11
2 320 200301 0.12
2 330 200301 -2.13
2 340 200301 -1.67
2 300 200302 4.23
2 320 200302 3.81
2 330 200302 1.55
2 340 200302 2.21
2 350 200302 4.23
2 100 200302 5.23
3 56 200301 3.11
3 100 200301 2.11
3 763 200301 5.13
3 412 200301 4.67
3 124 200302 3.23
3 97 200302 3.81
3 195 200302 1.55
4 740 200302 0.21
4 250 200302 -1.21
4 812 200301 -1.11
4 823 200301 7.12
5 835 200301 -3.33
5 947 200301 -4.67
5 980 200302 4.23
6 620 200302 3.81
;
run;
proc sql;
create table portfolio_averages as
select
portid,
myear,
( sum(S_portid) - S_portid ) as SUM_ret_other_portfolios,
( sum(N_portid) - N_portid ) as N_other_portid
from
(
select
portid,
myear,
sum(return) as S_portid,
count(return) as N_portid
from
return
group by
portid, myear
)
group by
myear
order by
myear, portid;
quit;
proc sql;
create table Firms_averages as
select
firmid,
myear,
SUM(return) as SUM_firmid,
count(return) as N_firmid
from
return
group by
myear, firmid
order by
myear, firmid;
quit;
/* here we add the return to the sum, because we extract it (it's used in SUM_firmid, while it shouldn't be used at all) */
proc sql;
create table RETURN as
select
r.firmid,
r.portid,
r.myear,
r.return,
(p.SUM_ret_other_portfolios - f.SUM_firmid + r.return) / (p.N_other_portid - f.N_firmid + 1) as return_estim
from
return r
left join Firms_averages f on
r.firmid = f.firmid and r.myear = f.myear
left join portfolio_averages p on
p.myear = r.myear and p.portid = r.portid
order by
portid, myear, firmid;
quit;
If you wish so, as proposed by DanielSantos, you can integrate everything in one sql query.
Just substitute the ( sql code ) of a certain table instead of it's alias in a final query, but as I mentioned above - it will be hard to check the correctness, while maximum momentary hard-disk space consumption would be the same.
I checked it manually for several firmids, portfolios and years and still think - that this will solve your task.
OK, now your last sample is a good one.
I've updated my proposal (in blue) to suit the multiplicity of the data. It's actually just a matter of grouping the data. Again, as along as there is no dups by portid, firmid and myear, I think it will do what you need in a efficiently manner.
Logic is the same from my previous post, average the cross join (with the convenient clause) aggregates by portid/myear and firmid/myear subtracting the current return, which is not to be considered.
proc sql noprint;
create table want (drop = _:) as
select a.portid, a.firmid, a.myear, a.return,
sum(b.sumreturn)-c.sumreturn+a.return as _sum, sum(b.cntreturn)-c.cntreturn+1 as _cnt,
calculated _sum / calculated _cnt as avgreturn
from
return as a,
(select portid, myear, sum(return) as sumreturn, count(return) as cntreturn from return group by portid, myear) as b,
(select firmid, myear, sum(return) as sumreturn, count(return) as cntreturn from return group by firmid, myear) as c
where a.portid ne b.portid and a.myear eq b.myear and a.firmid eq c.firmid and b.myear eq c.myear
group by a.portid, a.firmid, a.myear, a.return;
quit;
Cheers from Portugal.
Daniel Santos @ www.cgd.pt
Cool. I tested it and result on a sample coincides with mine. And it's shorter!
The best decision here
Thank you all for your invaluable help. This is really a great community dedicated to sharing and fostering our knowledge.
Best,
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.