Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- SAS Procedures
- /
- Proc SQL with Large Data Sets

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

10-29-2012 03:02 PM

Dear All,

Below is a portion of my data set. The code below does what I need to accomplish (which greatly benefited from your help in the past): merge these two data sets (return and need), but make sure that myear is the same, but portid is not the same. The actual sizes of the data sets are 961,896 rows X 12 columns for return, and 15,600 rows X 3,085 columns for need. When I use the actual data, the program runs for over 24 hours and then says it has to stop due to lack of hard disk space (after creating a temporary file of about 300 gb). Is there a waPry to run this PROC SQL part more efficiently (or avoid it)?

Thank you

**************************************************************************************************;

DATA return;

INPUT portid firmid myear return;

DATALINES;

1 100 200301 2.11

1 120 200301 3.12

1 130 200301 2.13

1 140 200301 3.67

1 100 200302 5.23

1 120 200302 6.81

1 130 200302 2.55

1 140 200302 4.21

1 150 200302 3.21

2 300 200301 1.11

2 320 200301 0.12

2 330 200301 -2.13

2 340 200301 -1.67

2 300 200302 4.23

2 320 200302 3.81

2 330 200302 1.55

2 340 200302 2.21

2 350 200302 4.23

2 100 200302 3.66

;

run;

**************************************************************************************************************;

proc transpose data=return out=need (drop=_name_);

var return;

id firmid;

by portid myear;

run;

************************************************************************************************************;

proc sql;

create table want as

select *

from return a, need b

where a.myear=b.myear and a.portid ne b.portid;

quit;

**********************************************************************************************************;

Accepted Solutions

Solution

10-31-2012
07:43 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to finans_sas

10-31-2012 07:43 AM

OK, now your last sample is a good one.

I've updated my proposal (in blue) to suit the multiplicity of the data. It's actually just a matter of grouping the data. Again, as along as there is no dups by portid, firmid and myear, I think it will do what you need in a efficiently manner.

Logic is the same from my previous post, average the cross join (with the convenient clause) aggregates by portid/myear and firmid/myear subtracting the current return, which is not to be considered.

**proc** **sql** noprint;

create table want (**drop = _:**) as

select a.portid, a.firmid, a.myear, a.return,

** sum(b.sumreturn)-c.sumreturn+a.return as _sum, sum(b.cntreturn)-c.cntreturn+1 as _cnt,**

** calculated _sum / calculated _cnt as avgreturn**

from

return as a,

(select portid, myear, sum(return) as sumreturn, count(return) as cntreturn from return group by portid, myear) as b,

(select firmid, myear, sum(return) as sumreturn, count(return) as cntreturn from return group by firmid, myear) as c

where a.portid ne b.portid and a.myear eq b.myear and a.firmid eq c.firmid and b.myear eq c.myear

**group by a.portid, a.firmid, a.myear, a.return;**

**quit**;

Cheers from Portugal.

Daniel Santos @ www.cgd.pt

All Replies

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to finans_sas

10-29-2012 03:52 PM

Hi,

Yes, Cartesian product will exhaust your resources. I haven't done much SQL join inside SAS, I usually do it via pass-through, which handles large datasets pretty well, but it would not help with your problem. So I will suggest you using Hash() approach. First make sure you have enough memory(computer RAM) to accommodate the entire 'need', then you may proceed with the following code:

DATA return;

INPUT portid firmid myear return;

DATALINES;

1 100 200301 2.11

1 120 200301 3.12

1 130 200301 2.13

1 140 200301 3.67

1 100 200302 5.23

1 120 200302 6.81

1 130 200302 2.55

1 140 200302 4.21

1 150 200302 3.21

2 300 200301 1.11

2 320 200301 0.12

2 330 200301 -2.13

2 340 200301 -1.67

2 300 200302 4.23

2 320 200302 3.81

2 330 200302 1.55

2 340 200302 2.21

2 350 200302 4.23

2 100 200302 3.66

;

run;

data want;

if _n_=1 then do;

if 0 then set need(rename=portid=np);

declare hash h(dataset:'need(rename=portid=np)', multidata:'y');

h.definekey('myear');

h.definedata(all:'y');

h.definedone();

end;

set return;

rc=h.find();

do rc=0 by 0 while (rc=0);

if portid ne np then output;

rc=h.find_next();

end;

drop np rc;

run;

Haikuo

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to finans_sas

10-30-2012 03:11 AM

Dear finans_sas,

could you please explain the target of your manipulation?

Whether you will use hash or not, the total design of calculation looks quite ineffective and strange.

What's the target?

You have the **return** of the firm (firmid), but what do you want to calculate by joining the transposed table for a certain year period and **portid** with it's source with different **portid?**

Maybe we will be able to propose a better design for you research?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to finans_sas

10-30-2012 05:18 AM

Thank you, Haikuo and ghastly_kitten for your help. Below is what I want to accomplish;

My objective is to calculate average returns for firmid in each portid-myear pair by using returns from the same myear, but from different firmid and portid. For example,

the average return for:

firmid=100 in portid=2 and myear=200302 (observation # 19)=(6.81 + 2.55 + 4.21 + 3.21)/4=

firmid=120 in portid=1 and myear=200301 (observation #2)=(1.11 + 0.12 -2.13 -1.67)/4=

I would really appreciate your help with this problem.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to finans_sas

10-30-2012 05:46 AM

Okay...But I'll need some information.

Let's assume you have a table with corteges <pid, fid, year, ret>.

The objective is to calculate

ret2 for each cortege by this mnemonic formula:

<pid', fid', y', ret'> - the cortege for which we calculate our ret2.

ret2 = avg(ret) over ( pid <> pid' , y == y', fid <> fid')

Here it will be usefull if you provide some euristics about the data.

Like "Portid could be only 1 to 5". That will help to find a better way of solving the task.

And moreover, if you'll have portid=3 in some row, will ret2 be calculated as an AVG through different portid's or it will be seperate for each value of portid<>portid' ?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to finans_sas

10-30-2012 06:34 AM

Here's the scheme of calculation which could be useful.

The key idea is that calculating averages is easier than producing cartesian join.

Example:

firmid=100 in portid=2 and myear=200302 (observation # 19)=(6.81 + 2.55 + 4.21 + 3.21)/4=

here you can first calculate average for ret for all firmids.

Let us assume, that we have firmid = 100 for portid = 1 and myear = 200302 with some ret value **R.**

A = (6.81 + 2.55 + 4.21 + 3.21 **+ R**)/5

N = 5 (amount of values used)

Now, as we now that **R **shouldn't be used for calculating real average (let's denote it **B**) we can do such a thing

**B = (**A * N - R) / (N - 1);

Now the real task is harder, since we can have more that 2 portids (but at the same time we assume that there is no entries with different **ret **for the same **firmid portid myear**), and we need to calc averages over different portids. But it's quite solveable in the same manner.

**So my proposal is to use this scheme:**

1. Calculate averages and count (or sum and count) for each group of

proc sql;

create table Averages as

select

portid as pid,

myear,

avg(ret) as A_pid,

count(ret) as N_pid

from

return

group by

portid, myear

order by

portid, myear;

quit;

/* Now! if we need to have an average over different portid which is not equal to source (for example for firmid=100 and portid=1 there are different firms for portid = 2,3,4,5,... ; but we need the result without grouping by this portid's, then we need averages over all portid's within the same year */

proc sql;

create table Averages as

select

pid,

myear,

A_pid,

N_pid,

sum( A_pid * N_pid ) / sum (N_pid) as total_average /* that is just total average of all ret's; you don't need it, just to point the meaning */,

(sum( A_pid * N_pid ) - A_pid * N_pid ) / ( sum (N_pid) - N_pid ) as AVG_RET_for_other_pids_same_year,

sum (N_pid) - N_pid as NUM_of_terms_in_avg

from

Averages

group by

myear

order by

pid, myear;

quit;

/* yes, the grouping is only within **myear **and it will work, and quite fast */

2. Now, to get ret2 for each firmid we need to make "an exclusion list" for each firmid.

As by calculating the target ret2 we don't need values of firmid at all, we calculate SUM (this will help to simplify formulas) and amount:

proc sql;

create table Firmid_averages as

select

firmid as fid,

myear,

SUM(ret) as R_fid,

count(*) as N_fid

from

return

group by

firmid, myear;

quit;

3. Finally, get ret2 (which you can **LEFT** join to source table after that, in order to compare ret and ret2)

proc sql;

create table RETURN2 as

select

f.fid as firmid,

p.pid as potid,

f.myear,

(p.AVG_RET_for_other_pids_same_year * p.NUM_of_terms_in_avg - f.R_fid) / (p.NUM_of_terms_in_avg - f.N_fid) as RET2

from

Firmid_averages f left join

Averages p on

f.myear = p.myear

order by

portid, myear, firmid;

quit;

4. Left join this to your source if you need to see the difference between ret and ret2.

... well, the code is a bit entangled. Actually, everything could be slightly simplified if you will use sums instead of averages. But I was afraid that the idea will not be clear from the simple code. So... I used this.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to finans_sas

10-30-2012 08:18 AM

Hello.

If I got this right, it's just a matter of summing/counting the data by porid, myear and merge back with the original dataset with the correct join clause.

For this kind of data volume (under 1M) a simple two level SQL join will do the trick.

**proc** **sql** noprint;

create table want as

select a.portid, a.firmid, a.myear, a.return,(b.sumreturn/b.cntreturn) as avgreturn

from

return as a,

(select portid, myear, sum(return) as sumreturn, count(return) as cntreturn

from return group by portid, myear) as b

where a.portid <> b.portid and a.myear = b.myear;

**quit**;

Cheers from Portugal.

Daniel Santos @ www.cgd.pt

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to DanielSantos

10-30-2012 08:54 AM

DanielSantos,

If **I **got the task right, this is not that easy

For each average return you need to exclude the firms with the same id from calculating averages.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to ghastly_kitten

10-30-2012 11:45 AM

OK ghastly_kitten, you're right I missed this requirement, as long as there is no duplicates I think it's just a matter of throwing firmid in a cross join and adjusting the calculation to exclude those records, like this:

**proc** **sql** noprint;

create table want as

select a.portid, a.firmid,

(b.sumreturn-c.sumreturn+return)/(b.cntreturn-c.cntreturn+**1**) as avgreturn

from

return as a,

(select portid, myear, sum(return) as sumreturn, count(return) as cntreturn

from return group by portid, myear) as b,

(select firmid, myear, sum(return) as sumreturn, count(return) as cntreturn

from return group by firmid, myear) as c

where a.portid ne b.portid and a.myear eq b.myear and a.firmid eq c.firmid and b.myear eq c.myear;

**quit**;

Which will produce this, which I believe is right:

portid | firmid | myear | return | avgreturn |

1 | 120 | 200301 | 3.12 | -0.6425000 |

1 | 130 | 200301 | 2.13 | -0.6425000 |

1 | 100 | 200301 | 2.11 | -0.6425000 |

1 | 140 | 200301 | 3.67 | -0.6425000 |

1 | 120 | 200302 | 6.81 | 3.2816667 |

1 | 130 | 200302 | 2.55 | 3.2816667 |

1 | 150 | 200302 | 3.21 | 3.2816667 |

1 | 140 | 200302 | 4.21 | 3.2816667 |

1 | 100 | 200302 | 5.23 | 3.2060000 |

2 | 330 | 200301 | -2.13 | 2.7575000 |

2 | 300 | 200301 | 1.11 | 2.7575000 |

2 | 340 | 200301 | -1.67 | 2.7575000 |

2 | 320 | 200301 | 0.12 | 2.7575000 |

2 | 300 | 200302 | 4.23 | -0.6425000 |

2 | 320 | 200302 | 3.81 | 4.4020000 |

2 | 350 | 200302 | 4.23 | 4.4020000 |

2 | 340 | 200302 | 2.21 | 4.4020000 |

2 | 330 | 200302 | 1.55 | 4.4020000 |

2 | 100 | 200302 | 3.66 | 4.1950000 |

Cheers from Portugal.

Daniel Santos @ www.cgd.pt

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to finans_sas

10-30-2012 10:51 AM

Dataset 'need' has 3,085 variables? I hope that's a typo. If it's not, it should be no surprise you're having a performance issue.

Get rid of the thousands of variables you don't need. Either use a DATA step with a KEEP statement to specify only the (maybe) couple dozen variables you require before the join/merge or instead of SELECT * in your query Select a list of the variables you require.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to finans_sas

10-30-2012 04:50 PM

I probably don't understand the problem correct but, in the chance that I do, I think a simplified version of Daniel's code together with a small datastep can solve the problem. Let me/us know.

proc sql noprint;

create table want as

select distinct a.portid,a.firmid,a.myear,a.return,b.sumreturn, b.cntreturn,bportid

from return as a

left join

(select portid as bportid, myear, sum(return) as sumreturn, count(return) as cntreturn

from return group by myear,bportid) as b

on a.myear=b.myear

where a.portid eq bportid

order by firmid,myear,portid

;

quit;

data want;

set want;

by firmid myear;

if not (first.myear and last.myear) then do;

sumreturn=sumreturn-return;

cntreturn=cntreturn-1;

end;

average=sumreturn/cntreturn;

run;

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to art297

10-30-2012 06:23 PM

I sincerely thank each of you for sparing your valuable time to help me out. I tried Hash( ) approach, but my hard disk ran out of space. I am still trying to implement ghastly_kitten' suggestion. Daniel's code is very efficient (takes about 1 minute instead of 24+ hours) and it works correctly in the small data set, but for some reasons, in my real data set, it is creating more than necessary observations (961,896 is the number of rows in "return", but Daniel's code is creating 47,132,904 observations). Arthur's modification solves that problem, but the average for a given firmid still contains that firm's returns. Below is Daniel's code (modified by Arthur and slightly changed by me). The only remaining task is to exclude a stock's own return in the computation of its corresponding average return (a stock's average return should be calculated by averaging all different stocks in different portfolios. In other words, for portid=1 and firmid=100, I cannot have any stocks from portid=1 and I must also exclude firmid=100 since the empirical design allow a firmid to exist in more than one portid) The final result should match Daniel's output. I will be happy to provide more detail if you would like. I am also trying to figure out how to impose the condition that a firm's return should be excluded from its average (taking me some time to figure out how Daniel did and how I can incorporate that into Arthur's code), but I would really appreciate your help in figuring this out.

proc sql noprint;

create table want as

select distinct a.portid,a.firmid,a.myear,a.return,b.sumreturn, b.cntreturn,bportid

from return as a

left join

(select portid as bportid, myear, sum(return) as sumreturn, count(return) as cntreturn

from return group by myear,bportid) as b

on a.myear=b.myear

where a.portid ne bportid *this is where I made a slight correction;

order by firmid,myear,portid

;

quit;

data want;

set want;

by firmid myear;

if not (first.myear and last.myear) then do;

sumreturn=sumreturn-return;

cntreturn=cntreturn-1;

end;

average=sumreturn/cntreturn;

run;

proc sort data=want; by portid myear firmid; run;

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to finans_sas

10-30-2012 06:27 PM

This is one of those cases where a macro and a loop may be more efficient than the BY method for calculating something, simply because of space and time.

1. Create your list of ID's and Years

2. Loop through doing each as required using a where clause to subset your data.

3. Calculate the stats and append to table, dropping intermediary tables

4. Output the end table.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to finans_sas

10-30-2012 07:02 PM

I thought the latest suggested code did what you wanted. What would help is an example dataset with the averages that you want to appear.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to art297

10-30-2012 07:33 PM

Below is what you have requested (based on the code I have). I hope this helps. The program works great, but takes a long time to process.

DATA return;

INPUT portid firmid myear return;

DATALINES;

1 100 200301 2.11

1 120 200301 3.12

1 130 200301 2.13

1 140 200301 3.67

1 100 200302 5.23

1 120 200302 6.81

1 130 200302 2.55

1 140 200302 4.21

1 150 200302 3.21

2 300 200301 1.11

2 320 200301 0.12

2 330 200301 -2.13

2 340 200301 -1.67

2 300 200302 4.23

2 320 200302 3.81

2 330 200302 1.55

2 340 200302 2.21

2 350 200302 4.23

2 100 200302 3.66

;

run;

proc transpose data=return out=need (drop=_name_);

var return;

id firmid;

by portid myear;

run;

data need; set need;

rename myear=myear2;

rename portid=portid2;

run;

proc sql;

create table want as

select *

from return, need

where return.myear=need.myear2 and return.portid ne need.portid2;

quit;

data want (drop=i _;;

set want;

array returns(*) _100--_350;

do i=1 to dim(returns);

if input(substr(vname(returns(i)),2),8.) eq firmid then

call missing(returns(i));

end;

average=mean(of returns(*));

run;

proc sort data=want; by portid myear firmid; run;