Solved: Re: Proc SQL with Large Data Sets - Page 2

art297 · Posted 10-30-2012 08:49 PM

I still don't know if I fully understand the problem you are trying to solve, but the following matches your test data results:

proc sql noprint;

create table want as

select distinct a.portid,a.firmid,a.myear,a.return,b.sumreturn, b.cntreturn,bportid

from return as a

left join

(select portid as bportid, myear, sum(return) as sumreturn, count(return) as cntreturn

from return group by myear,bportid) as b

on a.myear=b.myear

where a.portid ne bportid

order by firmid,myear,portid

;

quit;

data want (drop=next_measure prev_measure measure);

set want;

by firmid myear;

set want ( firstobs = 2 keep = return rename = (return = Next_Measure) )

want ( obs = 1 drop = _all_ );

Prev_Measure = ifn( first.myear, (.), lag(return) );

Next_Measure = ifn( last.myear, (.), Next_Measure );

if not (first.myear and last.myear) then do;

if first.myear then measure=next_measure;

else measure=prev_measure;

sumreturn=sumreturn-measure;

cntreturn=cntreturn-1;

end;

average=sumreturn/cntreturn;

run;

proc sort data=want;

by myear portid firmid;

run;

art297 · Posted 10-30-2012 09:04 PM

You apparently updated your post at the same time as I sent you a response. Take a look at what I had sent.

finans_sas · Posted 10-30-2012 09:56 PM

Thank you Arthur (art297) for spending your valuable time helping me out. Your updated version indeed generates the desirable output. However, it will take me a while to digest what exactly is going on in the code and to make sure that I can comfortably apply to my large data set. I am really sorry that I have not made my inquiry very clear. Maybe the story below may shed some light:

Imagine that you have two portfolios (portid 1 and 2) and company A is in both portid 1 and portid2. Portid 1 also contains Companies B and C. Portid 2 contains Companies D and E. In short,

Portid 1: A, B, C

Portid 2: A, D, E

I am interested in calculating an average value for each company by averaging returns of companies in other portfolios. For example,

average for C: (A+D+E)/3, all of which come from portid 2 (A is common)

average for E: (A+B+C)/3, all of which come from portid 1

average for A in portid 1= (D+E)/2

average for A in portid 2=(B+C)/2

Overall, I have two rules: a company's average return must exclude its own return (as in A) and must only include returns from other portfolios.

I hope my objective is clearer now.

art297 · Posted 10-30-2012 10:30 PM

As long as your real data is limited to two portfolios I think the code will work. In the case where there is only one portfolio, it simply takes the average. In the case where there are two portfolios, it gets the revenue to removed from the other portfolio.

finans_sas · Posted 10-30-2012 11:10 PM

Thank you for your further explanations. In this project, I have 50 portfolios, but sometimes it gets as high as 400. I should then be careful about the portfolio size.

art297 · Posted 10-30-2012 11:14 PM

If you post an example with 3 portfolios, and the results you want, we can test whether the methodology holds.

finans_sas · Posted 10-31-2012 12:17 AM

I just realized that my program also does not work when I have more than two portfolios. See the data set below. For example, for portid=6 and firmid=620, the average should be calculated averaging returns all of companies in portid 1 through portid 5 (where myear=200302), but all of the suggestions seem to calculate 4 or 5 different average for portid=6 and firmid=620.

DATA return;

INPUT portid firmid myear return;

DATALINES;

1 100 200301 2.11

1 120 200301 3.12

1 130 200301 2.13

1 140 200301 3.67

1 100 200302 5.23

1 120 200302 6.81

1 130 200302 2.55

1 140 200302 4.21

1 150 200302 3.21

2 300 200301 1.11

2 320 200301 0.12

2 330 200301 -2.13

2 340 200301 -1.67

2 300 200302 4.23

2 320 200302 3.81

2 330 200302 1.55

2 340 200302 2.21

2 350 200302 4.23

2 100 200302 5.23

3 56 200301 3.11

3 100 200301 2.11

3 763 200301 5.13

3 412 200301 4.67

3 124 200302 3.23

3 97 200302 3.81

3 195 200302 1.55

4 740 200302 0.21

4 250 200302 -1.21

4 812 200301 -1.11

4 823 200301 7.12

5 835 200301 -3.33

5 947 200301 -4.67

5 980 200302 4.23

6 620 200302 3.81

;

run;

ghastly_kitten · Posted 10-31-2012 03:03 AM

:smileygrin: funny.

There was a mistake in my decision, but it still works for your problem.

Here step by step (so it'll be easier to check the correctness of all steps):

Just copypaste and run it.

DATA return;

INPUT portid firmid myear return;

DATALINES;

1 100 200301 2.11

1 120 200301 3.12

1 130 200301 2.13

1 140 200301 3.67

1 100 200302 5.23

1 120 200302 6.81

1 130 200302 2.55

1 140 200302 4.21

1 150 200302 3.21

2 300 200301 1.11

2 320 200301 0.12

2 330 200301 -2.13

2 340 200301 -1.67

2 300 200302 4.23

2 320 200302 3.81

2 330 200302 1.55

2 340 200302 2.21

2 350 200302 4.23

2 100 200302 5.23

3 56 200301 3.11

3 100 200301 2.11

3 763 200301 5.13

3 412 200301 4.67

3 124 200302 3.23

3 97 200302 3.81

3 195 200302 1.55

4 740 200302 0.21

4 250 200302 -1.21

4 812 200301 -1.11

4 823 200301 7.12

5 835 200301 -3.33

5 947 200301 -4.67

5 980 200302 4.23

6 620 200302 3.81

;

run;

proc sql;

create table portfolio_averages as

select

portid,

myear,

( sum(S_portid) - S_portid ) as SUM_ret_other_portfolios,

( sum(N_portid) - N_portid ) as N_other_portid

from

(

select

portid,

myear,

sum(return) as S_portid,

count(return) as N_portid

from

return

group by

portid, myear

)

group by

myear

order by

myear, portid;

quit;

proc sql;

create table Firms_averages as

select

firmid,

myear,

SUM(return) as SUM_firmid,

count(return) as N_firmid

from

return

group by

myear, firmid

order by

myear, firmid;

quit;

/* here we add the return to the sum, because we extract it (it's used in SUM_firmid, while it shouldn't be used at all) */

proc sql;

create table RETURN as

select

r.firmid,

r.portid,

r.myear,

r.return,

(p.SUM_ret_other_portfolios - f.SUM_firmid + r.return) / (p.N_other_portid - f.N_firmid + 1) as return_estim

from

return r

left join Firms_averages f on

r.firmid = f.firmid and r.myear = f.myear

left join portfolio_averages p on

p.myear = r.myear and p.portid = r.portid

order by

portid, myear, firmid;

quit;

If you wish so, as proposed by DanielSantos, you can integrate everything in one sql query.

Just substitute the ( sql code ) of a certain table instead of it's alias in a final query, but as I mentioned above - it will be hard to check the correctness, while maximum momentary hard-disk space consumption would be the same.

I checked it manually for several firmids, portfolios and years and still think - that this will solve your task.

DanielSantos · Posted 10-31-2012 07:43 AM

OK, now your last sample is a good one.

I've updated my proposal (in blue) to suit the multiplicity of the data. It's actually just a matter of grouping the data. Again, as along as there is no dups by portid, firmid and myear, I think it will do what you need in a efficiently manner.

Logic is the same from my previous post, average the cross join (with the convenient clause) aggregates by portid/myear and firmid/myear subtracting the current return, which is not to be considered.

proc sql noprint;

create table want (drop = _:) as

select a.portid, a.firmid, a.myear, a.return,

sum(b.sumreturn)-c.sumreturn+a.return as _sum, sum(b.cntreturn)-c.cntreturn+1 as _cnt,

calculated _sum / calculated _cnt as avgreturn

from

return as a,

(select portid, myear, sum(return) as sumreturn, count(return) as cntreturn from return group by portid, myear) as b,

(select firmid, myear, sum(return) as sumreturn, count(return) as cntreturn from return group by firmid, myear) as c

where a.portid ne b.portid and a.myear eq b.myear and a.firmid eq c.firmid and b.myear eq c.myear

group by a.portid, a.firmid, a.myear, a.return;

quit;

Cheers from Portugal.

Daniel Santos @ www.cgd.pt

ghastly_kitten · Posted 10-31-2012 08:15 AM

Cool. I tested it and result on a sample coincides with mine. And it's shorter!

The best decision here

finans_sas · Posted 11-03-2012 11:07 AM

Thank you all for your invaluable help. This is really a great community dedicated to sharing and fostering our knowledge.

Best,

Registration is open

SAS Training: Just a Click Away