topic Re: Proc SQL with Large Data Sets in SAS Procedures

Proc SQL with Large Data Sets

finans_sas — Mon, 29 Oct 2012 19:02:16 GMT

Dear All,

Below is a portion of my data set. The code below does what I need to accomplish (which greatly benefited from your help in the past): merge these two data sets (return and need), but make sure that myear is the same, but portid is not the same. The actual sizes of the data sets are 961,896 rows X 12 columns for return, and 15,600 rows X 3,085 columns for need. When I use the actual data, the program runs for over 24 hours and then says it has to stop due to lack of hard disk space (after creating a temporary file of about 300 gb). Is there a waPry to run this PROC SQL part more efficiently (or avoid it)?

Thank you

**************************************************************************************************;

DATA return;

INPUT portid firmid myear return;

DATALINES;

1 100 200301 2.11

1 120 200301 3.12

1 130 200301 2.13

1 140 200301 3.67

1 100 200302 5.23

1 120 200302 6.81

1 130 200302 2.55

1 140 200302 4.21

1 150 200302 3.21

2 300 200301 1.11

2 320 200301 0.12

2 330 200301 -2.13

2 340 200301 -1.67

2 300 200302 4.23

2 320 200302 3.81

2 330 200302 1.55

2 340 200302 2.21

2 350 200302 4.23

2 100 200302 3.66

;

run;

**************************************************************************************************************;

proc transpose data=return out=need (drop=_name_);

var return;

id firmid;

by portid myear;

run;

************************************************************************************************************;

proc sql;

create table want as

select *

from return a, need b

where a.myear=b.myear and a.portid ne b.portid;

quit;

**********************************************************************************************************;

Re: Proc SQL with Large Data Sets

Haikuo — Mon, 29 Oct 2012 19:52:03 GMT

Hi,

Yes, Cartesian product will exhaust your resources. I haven't done much SQL join inside SAS, I usually do it via pass-through, which handles large datasets pretty well, but it would not help with your problem. So I will suggest you using Hash() approach. First make sure you have enough memory(computer RAM) to accommodate the entire 'need', then you may proceed with the following code:

DATA return;

INPUT portid firmid myear return;

DATALINES;

1 100 200301 2.11

1 120 200301 3.12

1 130 200301 2.13

1 140 200301 3.67

1 100 200302 5.23

1 120 200302 6.81

1 130 200302 2.55

1 140 200302 4.21

1 150 200302 3.21

2 300 200301 1.11

2 320 200301 0.12

2 330 200301 -2.13

2 340 200301 -1.67

2 300 200302 4.23

2 320 200302 3.81

2 330 200302 1.55

2 340 200302 2.21

2 350 200302 4.23

2 100 200302 3.66

;

run;

data want;

if _n_=1 then do;

if 0 then set need(rename=portid=np);

declare hash h(dataset:'need(rename=portid=np)', multidata:'y');

h.definekey('myear');

h.definedata(all:'y');

h.definedone();

end;

set return;

rc=h.find();

do rc=0 by 0 while (rc=0);

if portid ne np then output;

rc=h.find_next();

end;

drop np rc;

run;

Haikuo

Re: Proc SQL with Large Data Sets

ghastly_kitten — Tue, 30 Oct 2012 07:11:39 GMT

Dear finans_sas,

could you please explain the target of your manipulation?

Whether you will use hash or not, the total design of calculation looks quite ineffective and strange.

What's the target?

You have the return of the firm (firmid), but what do you want to calculate by joining the transposed table for a certain year period and portid with it's source with different portid?

Maybe we will be able to propose a better design for you research?

Re: Proc SQL with Large Data Sets

finans_sas — Tue, 30 Oct 2012 09:18:00 GMT

Thank you, Haikuo and ghastly_kitten for your help. Below is what I want to accomplish;

My objective is to calculate average returns for firmid in each portid-myear pair by using returns from the same myear, but from different firmid and portid. For example,

the average return for:

firmid=100 in portid=2 and myear=200302 (observation # 19)=(6.81 + 2.55 + 4.21 + 3.21)/4=

firmid=120 in portid=1 and myear=200301 (observation #2)=(1.11 + 0.12 -2.13 -1.67)/4=

I would really appreciate your help with this problem.

Re: Proc SQL with Large Data Sets

ghastly_kitten — Tue, 30 Oct 2012 09:46:47 GMT

Okay...But I'll need some information.

Let's assume you have a table with corteges <pid, fid, year, ret>.

The objective is to calculate

ret2 for each cortege by this mnemonic formula:

<pid', fid', y', ret'> - the cortege for which we calculate our ret2.

ret2 = avg(ret) over ( pid <> pid' , y == y', fid <> fid')

Here it will be usefull if you provide some euristics about the data.

Like "Portid could be only 1 to 5". That will help to find a better way of solving the task.

And moreover, if you'll have portid=3 in some row, will ret2 be calculated as an AVG through different portid's or it will be seperate for each value of portid<>portid' ?

Re: Proc SQL with Large Data Sets

ghastly_kitten — Tue, 30 Oct 2012 10:34:16 GMT

Here's the scheme of calculation which could be useful.

The key idea is that calculating averages is easier than producing cartesian join.

Example:

firmid=100 in portid=2 and myear=200302 (observation # 19)=(6.81 + 2.55 + 4.21 + 3.21)/4=

here you can first calculate average for ret for all firmids.

Let us assume, that we have firmid = 100 for portid = 1 and myear = 200302 with some ret value R.

A = (6.81 + 2.55 + 4.21 + 3.21 + R)/5

N = 5 (amount of values used)

Now, as we now that R shouldn't be used for calculating real average (let's denote it B) we can do such a thing

B = (A * N - R) / (N - 1);

Now the real task is harder, since we can have more that 2 portids (but at the same time we assume that there is no entries with different ret for the same firmid portid myear), and we need to calc averages over different portids. But it's quite solveable in the same manner.

So my proposal is to use this scheme:

1. Calculate averages and count (or sum and count) for each group of

proc sql;

create table Averages as

select

portid as pid,

myear,

avg(ret) as A_pid,

count(ret) as N_pid

from

return

group by

portid, myear

order by

portid, myear;

quit;

/* Now! if we need to have an average over different portid which is not equal to source (for example for firmid=100 and portid=1 there are different firms for portid = 2,3,4,5,... ; but we need the result without grouping by this portid's, then we need averages over all portid's within the same year */

proc sql;

create table Averages as

select

pid,

myear,

A_pid,

N_pid,

sum( A_pid * N_pid ) / sum (N_pid) as total_average /* that is just total average of all ret's; you don't need it, just to point the meaning */,

(sum( A_pid * N_pid ) - A_pid * N_pid ) / ( sum (N_pid) - N_pid ) as AVG_RET_for_other_pids_same_year,

sum (N_pid) - N_pid as NUM_of_terms_in_avg

from

Averages

group by

myear

order by

pid, myear;

quit;

/* yes, the grouping is only within myear and it will work, and quite fast */

2. Now, to get ret2 for each firmid we need to make "an exclusion list" for each firmid.

As by calculating the target ret2 we don't need values of firmid at all, we calculate SUM (this will help to simplify formulas) and amount:

proc sql;

create table Firmid_averages as

select

firmid as fid,

myear,

SUM(ret) as R_fid,

count(*) as N_fid

from

return

group by

firmid, myear;

quit;

3. Finally, get ret2 (which you can LEFT join to source table after that, in order to compare ret and ret2)

proc sql;

create table RETURN2 as

select

f.fid as firmid,

p.pid as potid,

f.myear,

(p.AVG_RET_for_other_pids_same_year * p.NUM_of_terms_in_avg - f.R_fid) / (p.NUM_of_terms_in_avg - f.N_fid) as RET2

from

Firmid_averages f left join

Averages p on

f.myear = p.myear

order by

portid, myear, firmid;

quit;

4. Left join this to your source if you need to see the difference between ret and ret2.

... well, the code is a bit entangled. Actually, everything could be slightly simplified if you will use sums instead of averages. But I was afraid that the idea will not be clear from the simple code. So... I used this.

Re: Proc SQL with Large Data Sets

DanielSantos — Tue, 30 Oct 2012 12:18:16 GMT

Hello.

If I got this right, it's just a matter of summing/counting the data by porid, myear and merge back with the original dataset with the correct join clause.

For this kind of data volume (under 1M) a simple two level SQL join will do the trick.

proc sql noprint;

create table want as

select a.portid, a.firmid, a.myear, a.return,(b.sumreturn/b.cntreturn) as avgreturn

from

return as a,

(select portid, myear, sum(return) as sumreturn, count(return) as cntreturn

from return group by portid, myear) as b

where a.portid <> b.portid and a.myear = b.myear;

quit;

Cheers from Portugal.

Daniel Santos @ www.cgd.pt

Re: Proc SQL with Large Data Sets

ghastly_kitten — Tue, 30 Oct 2012 12:54:12 GMT

DanielSantos,

If I got the task right, this is not that easy

For each average return you need to exclude the firms with the same id from calculating averages.

Re: Proc SQL with Large Data Sets

bentleyj1 — Tue, 30 Oct 2012 14:51:09 GMT

Dataset 'need' has 3,085 variables? I hope that's a typo. If it's not, it should be no surprise you're having a performance issue.

Get rid of the thousands of variables you don't need. Either use a DATA step with a KEEP statement to specify only the (maybe) couple dozen variables you require before the join/merge or instead of SELECT * in your query Select a list of the variables you require.

Re: Proc SQL with Large Data Sets

DanielSantos — Tue, 30 Oct 2012 15:45:37 GMT

OK ghastly_kitten, you're right I missed this requirement, as long as there is no duplicates I think it's just a matter of throwing firmid in a cross join and adjusting the calculation to exclude those records, like this:

proc sql noprint;

create table want as

select a.portid, a.firmid,

(b.sumreturn-c.sumreturn+return)/(b.cntreturn-c.cntreturn+1) as avgreturn

from

return as a,

(select portid, myear, sum(return) as sumreturn, count(return) as cntreturn

from return group by portid, myear) as b,

(select firmid, myear, sum(return) as sumreturn, count(return) as cntreturn

from return group by firmid, myear) as c

where a.portid ne b.portid and a.myear eq b.myear and a.firmid eq c.firmid and b.myear eq c.myear;

quit;

Which will produce this, which I believe is right:

portid	firmid	myear	return	avgreturn
1	120	200301	3.12	-0.6425000
1	130	200301	2.13	-0.6425000
1	100	200301	2.11	-0.6425000
1	140	200301	3.67	-0.6425000
1	120	200302	6.81	3.2816667
1	130	200302	2.55	3.2816667
1	150	200302	3.21	3.2816667
1	140	200302	4.21	3.2816667
1	100	200302	5.23	3.2060000
2	330	200301	-2.13	2.7575000
2	300	200301	1.11	2.7575000
2	340	200301	-1.67	2.7575000
2	320	200301	0.12	2.7575000
2	300	200302	4.23	-0.6425000
2	320	200302	3.81	4.4020000
2	350	200302	4.23	4.4020000
2	340	200302	2.21	4.4020000
2	330	200302	1.55	4.4020000
2	100	200302	3.66	4.1950000

Cheers from Portugal.

Daniel Santos @ www.cgd.pt

Re: Proc SQL with Large Data Sets

art297 — Tue, 30 Oct 2012 20:50:53 GMT

I probably don't understand the problem correct but, in the chance that I do, I think a simplified version of Daniel's code together with a small datastep can solve the problem. Let me/us know.

proc sql noprint;

create table want as

select distinct a.portid,a.firmid,a.myear,a.return,b.sumreturn, b.cntreturn,bportid

from return as a

left join

(select portid as bportid, myear, sum(return) as sumreturn, count(return) as cntreturn

from return group by myear,bportid) as b

on a.myear=b.myear

where a.portid eq bportid

order by firmid,myear,portid

;

quit;

data want;

set want;

by firmid myear;

if not (first.myear and last.myear) then do;

sumreturn=sumreturn-return;

cntreturn=cntreturn-1;

end;

average=sumreturn/cntreturn;

run;

Re: Proc SQL with Large Data Sets

finans_sas — Tue, 30 Oct 2012 22:23:34 GMT

I sincerely thank each of you for sparing your valuable time to help me out. I tried Hash( ) approach, but my hard disk ran out of space. I am still trying to implement ghastly_kitten' suggestion. Daniel's code is very efficient (takes about 1 minute instead of 24+ hours) and it works correctly in the small data set, but for some reasons, in my real data set, it is creating more than necessary observations (961,896 is the number of rows in "return", but Daniel's code is creating 47,132,904 observations). Arthur's modification solves that problem, but the average for a given firmid still contains that firm's returns. Below is Daniel's code (modified by Arthur and slightly changed by me). The only remaining task is to exclude a stock's own return in the computation of its corresponding average return (a stock's average return should be calculated by averaging all different stocks in different portfolios. In other words, for portid=1 and firmid=100, I cannot have any stocks from portid=1 and I must also exclude firmid=100 since the empirical design allow a firmid to exist in more than one portid) The final result should match Daniel's output. I will be happy to provide more detail if you would like. I am also trying to figure out how to impose the condition that a firm's return should be excluded from its average (taking me some time to figure out how Daniel did and how I can incorporate that into Arthur's code), but I would really appreciate your help in figuring this out.

proc sql noprint;

create table want as

select distinct a.portid,a.firmid,a.myear,a.return,b.sumreturn, b.cntreturn,bportid

from return as a

left join

(select portid as bportid, myear, sum(return) as sumreturn, count(return) as cntreturn

from return group by myear,bportid) as b

on a.myear=b.myear

where a.portid ne bportid *this is where I made a slight correction;

order by firmid,myear,portid

;

quit;

data want;

set want;

by firmid myear;

if not (first.myear and last.myear) then do;

sumreturn=sumreturn-return;

cntreturn=cntreturn-1;

end;

average=sumreturn/cntreturn;

run;

proc sort data=want; by portid myear firmid; run;

Re: Proc SQL with Large Data Sets

Reeza — Tue, 30 Oct 2012 22:27:52 GMT

This is one of those cases where a macro and a loop may be more efficient than the BY method for calculating something, simply because of space and time.

1. Create your list of ID's and Years

2. Loop through doing each as required using a where clause to subset your data.

3. Calculate the stats and append to table, dropping intermediary tables

4. Output the end table.

Re: Proc SQL with Large Data Sets

art297 — Tue, 30 Oct 2012 23:02:52 GMT

I thought the latest suggested code did what you wanted. What would help is an example dataset with the averages that you want to appear.

Re: Proc SQL with Large Data Sets

finans_sas — Tue, 30 Oct 2012 23:33:25 GMT

Below is what you have requested (based on the code I have). I hope this helps. The program works great, but takes a long time to process.

DATA return;

INPUT portid firmid myear return;

DATALINES;

1 100 200301 2.11

1 120 200301 3.12

1 130 200301 2.13

1 140 200301 3.67

1 100 200302 5.23

1 120 200302 6.81

1 130 200302 2.55

1 140 200302 4.21

1 150 200302 3.21

2 300 200301 1.11

2 320 200301 0.12

2 330 200301 -2.13

2 340 200301 -1.67

2 300 200302 4.23

2 320 200302 3.81

2 330 200302 1.55

2 340 200302 2.21

2 350 200302 4.23

2 100 200302 3.66

;

run;

proc transpose data=return out=need (drop=_name_);

var return;

id firmid;

by portid myear;

run;

data need; set need;

rename myear=myear2;

rename portid=portid2;

run;

proc sql;

create table want as

select *

from return, need

where return.myear=need.myear2 and return.portid ne need.portid2;

quit;

data want (drop=i _:);;

set want;

array returns(*) _100--_350;

do i=1 to dim(returns);

if input(substr(vname(returns(i)),2),8.) eq firmid then

call missing(returns(i));

end;

average=mean(of returns(*));

run;

proc sort data=want; by portid myear firmid; run;

Re: Proc SQL with Large Data Sets

art297 — Wed, 31 Oct 2012 00:49:27 GMT

I still don't know if I fully understand the problem you are trying to solve, but the following matches your test data results:

proc sql noprint;

create table want as

select distinct a.portid,a.firmid,a.myear,a.return,b.sumreturn, b.cntreturn,bportid

from return as a

left join

(select portid as bportid, myear, sum(return) as sumreturn, count(return) as cntreturn

from return group by myear,bportid) as b

on a.myear=b.myear

where a.portid ne bportid

order by firmid,myear,portid

;

quit;

data want (drop=next_measure prev_measure measure);

set want;

by firmid myear;

set want ( firstobs = 2 keep = return rename = (return = Next_Measure) )

want ( obs = 1 drop = _all_ );

Prev_Measure = ifn( first.myear, (.), lag(return) );

Next_Measure = ifn( last.myear, (.), Next_Measure );

if not (first.myear and last.myear) then do;

if first.myear then measure=next_measure;

else measure=prev_measure;

sumreturn=sumreturn-measure;

cntreturn=cntreturn-1;

end;

average=sumreturn/cntreturn;

run;

proc sort data=want;

by myear portid firmid;

run;

Re: Proc SQL with Large Data Sets

art297 — Wed, 31 Oct 2012 01:04:02 GMT

You apparently updated your post at the same time as I sent you a response. Take a look at what I had sent.

Re: Proc SQL with Large Data Sets

finans_sas — Wed, 31 Oct 2012 01:56:45 GMT

Thank you Arthur (art297) for spending your valuable time helping me out. Your updated version indeed generates the desirable output. However, it will take me a while to digest what exactly is going on in the code and to make sure that I can comfortably apply to my large data set. I am really sorry that I have not made my inquiry very clear. Maybe the story below may shed some light:

Imagine that you have two portfolios (portid 1 and 2) and company A is in both portid 1 and portid2. Portid 1 also contains Companies B and C. Portid 2 contains Companies D and E. In short,

Portid 1: A, B, C

Portid 2: A, D, E

I am interested in calculating an average value for each company by averaging returns of companies in other portfolios. For example,

average for C: (A+D+E)/3, all of which come from portid 2 (A is common)

average for E: (A+B+C)/3, all of which come from portid 1

average for A in portid 1= (D+E)/2

average for A in portid 2=(B+C)/2

Overall, I have two rules: a company's average return must exclude its own return (as in A) and must only include returns from other portfolios.

I hope my objective is clearer now.

Re: Proc SQL with Large Data Sets

art297 — Wed, 31 Oct 2012 02:30:11 GMT

As long as your real data is limited to two portfolios I think the code will work. In the case where there is only one portfolio, it simply takes the average. In the case where there are two portfolios, it gets the revenue to removed from the other portfolio.

Re: Proc SQL with Large Data Sets

finans_sas — Wed, 31 Oct 2012 03:10:22 GMT

Thank you for your further explanations. In this project, I have 50 portfolios, but sometimes it gets as high as 400. I should then be careful about the portfolio size.