Help using Base SAS procedures

Proc SQL with Large Data Sets

Accepted Solution Solved
Reply
Contributor
Posts: 57
Accepted Solution

Proc SQL with Large Data Sets

Dear All,

Below is a portion of my data set. The code below does what I need to accomplish (which greatly benefited from your help in the past): merge these two data sets (return and need), but make sure that myear is the same, but portid is not the same. The actual sizes of the data sets are 961,896 rows X 12 columns for return, and 15,600 rows X 3,085 columns for need. When I use the actual data, the program runs for over 24 hours and then says it has to stop due to lack of hard disk space (after creating a temporary file of about 300 gb). Is there a waPry to run this PROC SQL part more efficiently (or avoid it)?

Thank you

**************************************************************************************************;

DATA return;

INPUT portid firmid myear return;

DATALINES;

1 100 200301  2.11

1 120 200301  3.12

1 130 200301  2.13

1 140 200301  3.67

1 100 200302  5.23

1 120 200302  6.81

1 130 200302  2.55

1 140 200302  4.21

1 150 200302  3.21

2 300 200301  1.11

2 320 200301  0.12

2 330 200301 -2.13

2 340 200301 -1.67

2 300 200302  4.23

2 320 200302  3.81

2 330 200302  1.55

2 340 200302  2.21

2 350 200302  4.23

2 100 200302  3.66

;

run;

**************************************************************************************************************;

proc transpose data=return out=need (drop=_name_);

  var return;

  id firmid;

  by portid myear;

run;

************************************************************************************************************;

proc sql;

  create table want as

  select *

  from return a, need b

    where a.myear=b.myear and a.portid ne b.portid;

quit;

**********************************************************************************************************;


Accepted Solutions
Solution
‎10-31-2012 07:43 AM
Super Contributor
Posts: 474

Re: Proc SQL with Large Data Sets

Posted in reply to finans_sas

OK, now your last sample is a good one.

I've updated my proposal (in blue) to suit the multiplicity of the data. It's actually just a matter of grouping the data. Again, as along as there is no dups by portid, firmid and myear, I think it will do what you need in a efficiently manner.

Logic is the same from my previous post, average the cross join (with the convenient clause) aggregates by portid/myear and firmid/myear subtracting the current return, which is not to be considered.

proc sql noprint;

create table want (drop = _:) as

select a.portid, a.firmid, a.myear, a.return,

       sum(b.sumreturn)-c.sumreturn+a.return as _sum, sum(b.cntreturn)-c.cntreturn+1 as _cnt,

       calculated _sum / calculated _cnt as avgreturn

       from

       return as a,

       (select portid, myear, sum(return) as sumreturn, count(return) as cntreturn from return group by portid, myear) as b,

       (select firmid, myear, sum(return) as sumreturn, count(return) as cntreturn from return group by firmid, myear) as c

where a.portid ne b.portid and a.myear eq b.myear and a.firmid eq c.firmid and b.myear eq c.myear

group by a.portid, a.firmid,  a.myear, a.return;

quit;


Cheers from Portugal.

Daniel Santos @ www.cgd.pt


View solution in original post


All Replies
Respected Advisor
Posts: 3,156

Re: Proc SQL with Large Data Sets

Posted in reply to finans_sas

Hi,

Yes, Cartesian product will exhaust your resources. I haven't done much SQL join inside SAS, I usually do it via pass-through, which handles large datasets pretty well, but it would not help with your problem. So I will suggest you using Hash() approach. First make sure you have enough memory(computer RAM) to accommodate the entire 'need', then you may proceed with the following code:

DATA return;

INPUT portid firmid myear return;

DATALINES;

1 100 200301 2.11

1 120 200301 3.12

1 130 200301 2.13

1 140 200301 3.67

1 100 200302 5.23

1 120 200302 6.81

1 130 200302 2.55

1 140 200302 4.21

1 150 200302 3.21

2 300 200301 1.11

2 320 200301 0.12

2 330 200301 -2.13

2 340 200301 -1.67

2 300 200302 4.23

2 320 200302 3.81

2 330 200302 1.55

2 340 200302 2.21

2 350 200302 4.23

2 100 200302 3.66

;

run;

data want;

  if _n_=1 then do;

  if 0 then set need(rename=portid=np);

  declare hash h(dataset:'need(rename=portid=np)', multidata:'y');

h.definekey('myear');

h.definedata(all:'y');

h.definedone();

  end;

  set return;

  rc=h.find();

do rc=0 by 0 while (rc=0);

if portid ne np then output;

rc=h.find_next();

end;

drop np rc;

run;

Haikuo

Contributor
Posts: 41

Re: Proc SQL with Large Data Sets

Posted in reply to finans_sas

Dear finans_sas,

could you please explain the target of your manipulation?

Whether you will use hash or not, the total design of calculation looks quite ineffective and strange.

What's the target?

You have the return of the firm (firmid), but what do you want to calculate by joining the transposed table for a certain year period and portid with it's source with different portid?


Maybe we will be able to propose a better design for you research?

Contributor
Posts: 57

Re: Proc SQL with Large Data Sets

Posted in reply to finans_sas

Thank you, Haikuo and ghastly_kitten for your help. Below is what I want to accomplish;

My objective is to calculate average returns for firmid in each portid-myear pair by using returns from the same myear, but from different firmid and portid. For example,

the average return for:

firmid=100 in portid=2 and myear=200302 (observation # 19)=(6.81 + 2.55 + 4.21 + 3.21)/4=

firmid=120 in portid=1 and myear=200301 (observation #2)=(1.11 + 0.12 -2.13 -1.67)/4=

I would really appreciate your help with this problem.

Contributor
Posts: 41

Re: Proc SQL with Large Data Sets

Posted in reply to finans_sas

Okay...But I'll need some information.

Let's assume you have a table with corteges <pid, fid, year, ret>.

The objective is to calculate

ret2 for each cortege by this mnemonic formula:

<pid', fid', y', ret'> - the cortege for which we calculate our ret2.

ret2 = avg(ret) over ( pid <> pid' , y == y', fid <> fid')

Here it will be usefull if you provide some euristics about the data.

Like "Portid could be only 1 to 5". That will help to find a better way of solving the task.

And moreover, if you'll have portid=3 in some row, will ret2 be calculated as an AVG through different portid's or it will be seperate for each value of portid<>portid' ?

Contributor
Posts: 41

Re: Proc SQL with Large Data Sets

Posted in reply to finans_sas

Here's the scheme of calculation which could be useful.

The key idea is that calculating averages is easier than producing cartesian join.

Example:

firmid=100 in portid=2 and myear=200302 (observation # 19)=(6.81 + 2.55 + 4.21 + 3.21)/4=


here you can first calculate average for ret for all firmids.

Let us assume, that we have firmid = 100 for portid = 1 and myear = 200302 with some ret value R.

A = (6.81 + 2.55 + 4.21 + 3.21 + R)/5

N = 5 (amount of values used)

Now, as we now that R shouldn't be used for calculating real average (let's denote it B) we can do such a thing

B = (A * N - R) / (N - 1);


Now the real task is harder, since we can have more that 2 portids (but at the same time we assume that there is no entries with different ret for the same firmid portid myear), and we need to calc averages over different portids. But it's quite solveable in the same manner.


So my proposal is to use this scheme:

1. Calculate averages and count (or sum and count) for each group of

     proc sql;

          create table Averages as

          select

               portid as pid,

               myear,

               avg(ret) as A_pid,

               count(ret) as N_pid

          from

               return

          group by

               portid, myear

          order by

               portid, myear;

quit;

/* Now! if we need to have an average over different portid which is not equal to source (for example for firmid=100 and portid=1 there are different firms for portid = 2,3,4,5,... ; but we need the result without grouping by this portid's, then we need averages over all portid's within the same year */

     proc sql;

          create table Averages as

          select

               pid,

               myear,

               A_pid,

               N_pid,

               sum( A_pid * N_pid ) / sum (N_pid)  as total_average    /* that is just total average of all ret's; you don't need it, just to point the meaning */,

               (sum( A_pid * N_pid )  - A_pid * N_pid ) / ( sum (N_pid) - N_pid ) as AVG_RET_for_other_pids_same_year,

               sum (N_pid) - N_pid as NUM_of_terms_in_avg

          from

               Averages

          group by

               myear

          order by

               pid, myear;

     quit;

/* yes, the grouping is only within myear and it will work, and quite fast */

2. Now, to get ret2 for each firmid we need to make "an exclusion list" for each firmid.

As by calculating the target ret2 we don't need values of firmid at all, we calculate SUM (this will help to simplify formulas) and amount:

proc sql;

     create table Firmid_averages as

     select

          firmid as fid,

          myear,

          SUM(ret) as R_fid,

          count(*) as N_fid

     from

          return

     group by

          firmid, myear;

quit;

3. Finally, get ret2 (which you can LEFT join to source table after that, in order to compare ret and ret2)

proc sql;

     create table RETURN2 as

     select

          f.fid as firmid,

          p.pid as potid,

          f.myear,

          (p.AVG_RET_for_other_pids_same_year * p.NUM_of_terms_in_avg - f.R_fid) / (p.NUM_of_terms_in_avg - f.N_fid) as RET2

     from

          Firmid_averages f left join

          Averages p on

               f.myear = p.myear

      order by

             portid, myear, firmid;

quit;

4. Left join this to your source if you need to see the difference between ret and ret2.

... well, the code is a bit entangled. Actually, everything could be slightly simplified if you will use sums instead of averages. But I was afraid that the idea will not be clear from the simple code. So... I used this.

Super Contributor
Posts: 474

Re: Proc SQL with Large Data Sets

Posted in reply to finans_sas

Hello.

If I got this right, it's just a matter of summing/counting the data by porid, myear and merge back with the original dataset with the correct join clause.

For this kind of data volume (under 1M) a simple two level SQL join will do the trick.

proc sql noprint;

create table want as

select a.portid, a.firmid, a.myear, a.return,(b.sumreturn/b.cntreturn) as avgreturn

       from

       return as a,

       (select portid, myear, sum(return) as sumreturn, count(return) as cntreturn

               from return group by portid, myear) as b

where a.portid <> b.portid and a.myear = b.myear;

quit;


Cheers from Portugal.

Daniel Santos @ www.cgd.pt

Contributor
Posts: 41

Re: Proc SQL with Large Data Sets

Posted in reply to DanielSantos

DanielSantos,

If I got the task right, this is not that easy Smiley Happy

For each average return you need to exclude the firms with the same id from calculating averages.

Super Contributor
Posts: 474

Re: Proc SQL with Large Data Sets

Posted in reply to ghastly_kitten

OK ghastly_kitten, you're right I missed this requirement, as long as there is no duplicates I think it's just a matter of throwing firmid in a cross join and adjusting the calculation to exclude those records, like this:

proc sql noprint;

create table want as

select a.portid, a.firmid,

       (b.sumreturn-c.sumreturn+return)/(b.cntreturn-c.cntreturn+1) as avgreturn

       from

       return as a,

       (select portid, myear, sum(return) as sumreturn, count(return) as cntreturn

               from return group by portid, myear) as b,

       (select firmid, myear, sum(return) as sumreturn, count(return) as cntreturn

               from return group by firmid, myear) as c

where a.portid ne b.portid and a.myear eq b.myear and a.firmid eq c.firmid and b.myear eq c.myear;

quit;

Which will produce this, which I believe is right:

portidfirmidmyearreturnavgreturn
11202003013.12-0.6425000
11302003012.13-0.6425000
11002003012.11-0.6425000
11402003013.67-0.6425000
11202003026.813.2816667
11302003022.553.2816667
11502003023.213.2816667
11402003024.213.2816667
11002003025.233.2060000
2330200301-2.132.7575000
23002003011.112.7575000
2340200301-1.672.7575000
23202003010.122.7575000
23002003024.23-0.6425000
23202003023.814.4020000
23502003024.234.4020000
23402003022.214.4020000
23302003021.554.4020000
21002003023.664.1950000


Cheers from Portugal.

Daniel Santos @ www.cgd.pt

Contributor
Posts: 69

Re: Proc SQL with Large Data Sets

Posted in reply to finans_sas

Dataset 'need' has 3,085 variables?  I hope that's a typo.  If it's not, it should be no surprise you're having a performance issue.

Get rid of the thousands of variables you don't need.  Either use a DATA step with a KEEP statement to specify only the (maybe) couple dozen variables you require before the join/merge or instead of SELECT * in your query Select a list of the variables you require.

PROC Star
Posts: 7,492

Re: Proc SQL with Large Data Sets

Posted in reply to finans_sas

I probably don't understand the problem correct but, in the chance that I do, I think a simplified version of Daniel's code together with a small datastep can solve the problem.  Let me/us know.

proc sql noprint;

  create table want as

    select distinct a.portid,a.firmid,a.myear,a.return,b.sumreturn, b.cntreturn,bportid

      from return as a

        left join

          (select portid as bportid, myear, sum(return) as sumreturn, count(return) as cntreturn

               from return group by myear,bportid) as b

             on a.myear=b.myear

               where a.portid eq bportid

                 order by firmid,myear,portid

  ;

quit;

data want;

  set want;

  by firmid myear;

  if not (first.myear and last.myear) then do;

    sumreturn=sumreturn-return;

    cntreturn=cntreturn-1;

  end;

  average=sumreturn/cntreturn;

run;

Contributor
Posts: 57

Re: Proc SQL with Large Data Sets

I sincerely thank each of you for sparing your valuable time to help me out. I tried Hash( ) approach, but my hard disk ran out of space. I am still trying to implement ghastly_kitten' suggestion. Daniel's code is very efficient (takes about 1 minute instead of 24+ hours) and it works correctly in the small data set, but for some reasons, in my real data set, it is creating more than necessary observations (961,896 is the number of rows in "return", but Daniel's code is creating 47,132,904 observations). Arthur's modification solves that problem, but the average for a given firmid still contains that firm's returns. Below is Daniel's code (modified by Arthur and slightly changed by me). The only remaining task is to exclude a stock's own return in the computation of its corresponding average return (a stock's average return should be calculated by averaging all different stocks in different portfolios. In other words, for portid=1 and firmid=100, I cannot have any stocks from portid=1 and I must also exclude firmid=100 since the empirical design allow a firmid to exist in more than one portid) The final result should match Daniel's output. I will be happy to provide more detail if you would like. I am also trying to figure out how to impose the condition that a firm's return should be excluded from its average (taking me some time to figure out how Daniel did and how I can incorporate that into Arthur's code), but I would really appreciate your help in figuring this out.

proc sql noprint;

  create table want as

    select distinct a.portid,a.firmid,a.myear,a.return,b.sumreturn, b.cntreturn,bportid

      from return as a

        left join

          (select portid as bportid, myear, sum(return) as sumreturn, count(return) as cntreturn

               from return group by myear,bportid) as b

             on a.myear=b.myear

               where a.portid ne bportid *this is where I made a slight correction;

                 order by firmid,myear,portid

  ;

quit;

data want;

  set want;

  by firmid myear;

  if not (first.myear and last.myear) then do;

    sumreturn=sumreturn-return;

    cntreturn=cntreturn-1;

  end;

average=sumreturn/cntreturn;

run;

proc sort data=want; by portid myear firmid; run;

Super User
Posts: 19,872

Re: Proc SQL with Large Data Sets

Posted in reply to finans_sas

This is one of those cases where a macro and a loop may be more efficient than the BY method for calculating something, simply because of space and time.

1. Create your list of ID's and Years

2. Loop through doing each as required using a where clause to subset your data.

3. Calculate the stats and append to table, dropping intermediary tables

4. Output the end table.

PROC Star
Posts: 7,492

Re: Proc SQL with Large Data Sets

Posted in reply to finans_sas

I thought the latest suggested code did what you wanted.  What would help is an example dataset with the averages that you want to appear.

Contributor
Posts: 57

Re: Proc SQL with Large Data Sets

Below is what you have requested (based on the code I have). I hope this helps. The program works great, but takes a long time to process.

DATA return;

INPUT portid firmid myear return;

DATALINES;

1 100 200301  2.11

1 120 200301  3.12

1 130 200301  2.13

1 140 200301  3.67

1 100 200302  5.23

1 120 200302  6.81

1 130 200302  2.55

1 140 200302  4.21

1 150 200302  3.21

2 300 200301  1.11

2 320 200301  0.12

2 330 200301 -2.13

2 340 200301 -1.67

2 300 200302  4.23

2 320 200302  3.81

2 330 200302  1.55

2 340 200302  2.21

2 350 200302  4.23

2 100 200302  3.66

;

run;

proc transpose data=return out=need (drop=_name_);

  var return;

  id firmid;

  by portid myear;

run;

data need; set need;

rename myear=myear2;

rename portid=portid2;

run;

proc sql;

create table want as

  select *

  from return, need

  where return.myear=need.myear2 and return.portid ne need.portid2;

quit;

data want (drop=i _Smiley Happy;;

  set want;

  array returns(*) _100--_350;

  do i=1 to dim(returns);

    if input(substr(vname(returns(i)),2),8.) eq firmid then

      call missing(returns(i));

  end;

  average=mean(of returns(*));

run;

proc sort data=want;  by portid myear firmid; run;

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 25 replies
  • 757 views
  • 14 likes
  • 7 in conversation