SAS Programming

EJAA · Posted 10-28-2017 01:42 AM

Hi All,

I'm running the codes below on two data sets but its taking extremely long to process. I don't really know what is wrong with my codes that is taking too long to process when i run. I have about 9700 firms in one data set and in another data set a list of firms connected to each of the 9700 firms. Due to the slowness i did run from 1-50 firms but that is even not processing after a day. The code gives no errors. Any advice for me to correct the slowness. Thanks in advance for the help. Regards ejaa.

/*****getiing the count of number of lpermno in help*****/
proc sql noprint; select count(lpermno) into: A from ase.help ; quit ;

%put &A ;

/********getting a sequence order to observations**********/
Data ase.help ;
Set ase.help;
Retain seq 0 ;
Seq = seq + 1 ;
run ;

/*******Transpose help to get long format*********************/
proc transpose data=ase.help out=ase.help_trans;
by lpermno seq;
run;

Data ase.help_trans (Rename = (COL1 = companyid)) ;
Set ase.help_trans ;
Where COL1 ne . ;
Run ;

%let C1 = 1 ;
%let c2 = 50 ;

/***********Separating each set of firms attched to lpermno*************/

%macro rpt;

%Do R = &C1 %to &C2;

data ase.lp&R. (rename = (Seq = seq&R.)) ;
Set ase.help_trans ;
Where seq = &R ;
Run ;

proc sort data = ase.lp&R.; by companyid ; run ;
%end ;
%mend rpt ;

%rpt ;

/***********Merge separated sets to help1******************/

Proc sort data = ase.help1 ; By companyid ; run ;

%macro rpt1;
%Do R = &C1 %to &C2;
data ase.com_1 ;
Merge ase.lp1 - ase.lp&R. ;
By companyid ;
run ;
%end ;
%mend rpt1 ;

%rpt1 ;

Proc sort data = ase.com_1 ; By companyid ; run ;

data ase.com ;
Merge ase.help1(in= a drop = lpermno ) ase.com_1 ;
By companyid ;
If a ;
run ;

/**************Select different lpermno from help********************/

%macro rpt2;
%Do R = &C1 %to &C2;
Proc sort data = ase.lp&R. nodupkey out = ase.f&R.; By lpermno ; run ;
%end ;
%mend rpt2 ;

%rpt2 ;

/***************************merge lpermno from help to help 1 *****************************/
Proc sort data = ase.help1 ; By lpermno ; run ;

%macro rpt3;
%Do R = &C1 %to &C2;
Data ase.Com_long1 ;
merge ase.f1 - ase.f&R. ;
by lpermno ;
Run ;
%end ;
%mend rpt3 ;

%rpt3 ;

Data ase.Com_long ;
merge ase.help1 (in = a) ase.Com_long1 ;
by lpermno ;
If a ;
Run ;

/*************Getting different sets of firms out of help1***********************/
%macro rpt4;
%Do R = &C1 %to &C2;
data ase.sum&R. ;
Set ase.com ;
where seq&R. = &R ;
Run ;
Proc sort data = ase.sum&R. ; By date ; run ;
%end ;
%mend rpt4 ;

%rpt4 ;

/*******getting mean prices by date in all the sets*************************/
%macro rpt5;
%Do R = &C1 %to &C2;
proc summary data = ase.sum&R. ;
by date;
var vol;
output out=ase.sum_&R. mean=beta_vol ;
run;
%end;
%mend rpt5 ;

%rpt5 ;

/********************separating lpermno to different sets from help1*********************************/
%macro rpt6;
%Do R = &C1 %to &C2;
data ase.price&R. ;
Set ase.Com_long ;
where seq&R. = &R ;
Run ;
%end ;
%mend rpt6 ;

%rpt6 ;

/***********************Merging matching lpermno and attched firms by date*********************************/
%macro rpt7 ;
%Do R = &C1 %to &C2;
data ase.final&R.(drop = _TYPE_ _FREQ_ _NAME_ ) ;
Merge ase.price&R. (in = a) ase.sum_&R. (in = b) ;
By date ;
If a and b ;
run ;
%end ;
%mend rpt7 ;

%rpt7 ;

/************************Getting regression of each set**********************************/
%macro rpt8;
%Do R = &C1 %to &C2;
proc reg data=ase.final&R. outest = ase.result&R. ;
By lpermno ;
model vol = beta_vol stk ;
run;
%end ;
%mend rpt8 ;

%rpt8 ;

/**********getting results together****************************/

%macro rpt9;
%Do R = &C1 %to &C2;
Data ase.reg1 ;
Set ase.result1 - ase.result&R. ;
run;
%end ;
%mend rpt9 ;

%rpt9 ;

/**********************END********************************/

ChrisNZ · Posted 11-02-2017 05:28 PM

I couldn't agree more with @jklaverstijn if you have all these columns.

Try this which turns the table around. The join takes about the same time as the hash table on my example.

Modify it to suit your data.


data VERT;
  set HELP;
  array A[2] D1-D2;
  do I=1 to 2;
    D=A[I];
    if D then output;
  end;
  keep COMPANYID LPERMNO D;
run;  

proc sql ;
  create table W as 
  select source.COMPANYID, source.LPERMNO, source.DATE, source.VOLUME, mean(betavol.VOLUME) as BETAVOL 
  from HELP1     source
         left join 
       VERT      links
         on source.COMPANYID=links.COMPANYID
         inner join 
       HELP1     betavol
         on  betavol.COMPANYID=links.D
         and betavol.DATE     =source.DATE
  group by 1,2,3,4
  order by 1,2,3,4
;
quit;

High-Performance SAS Coding - Third Edition

View solution in original post

ChrisBrooks · Posted 10-28-2017 04:54 AM

If you can't get the program to finish my advice would be to run it step by step until you find something that doesn't finish - my guess is it'll be the Proc Reg but if you run it as I suggest at least you'll know where to concentrate your efforts.

jklaverstijn · Posted 10-28-2017 06:22 AM

Examine your log and look at the "real time" lines after each step. Especially the ones that show a high ratio between real time and cpu time are suspect and candidates for improvement. Also explain your data in more detail. Are they SAS datasets or tables in a DBMS? Do you use indexes? Maybe sorting will help? Performance is an elaborate field of expertise and lots of info is required before anyone can help. Share a sample of your data and above all your logs in this forum. The first commandment of performance analysis is " Thou Shalt Read Log Files". Use FULLSTIMER and MSGLEVEL=I options for more detailed info. And in case of macro's, use at least the MPRINT, SYMBOLGEN and MLOGIC options as well.

Hop this helps,

- Jan.

Astounding · Posted 10-28-2017 10:20 AM

Just spot-checking the program, the design looks like it will take about 25 times what it needs to. Let me give you just a couple of examples, so you can rethink the process.

Example 1: %RPT

This step reads the data set ASE.HELP_TRANS 50 times, once for each iteration of the %DO loop. Better planning would read the data set just once, and would create all 50 data sets with one pass through the data. Just a brief example to give you the idea:

proc sort data=ase.help_trans;

by company_id;

run;

data ase.lp1 (rename=(seq=seq1))

ase.lp50 (rename=(seq=seq50));

set ase.help_trans;

if seq=1 then output ase.lp1;

else if seq=50 then output ase.lp50;

run;

The use of macro language would get a little more complex than what you have now. Is that worth it, to cut out 95% of the time?

By sorting first, you can eliminate the need for 50 sorts (albeit on smaller data sets) later.

Example 2: %RPT1

It's not clear if this step contains a typo, because it re-creates the same data set 50 times. Each iteration replaces ASE.COM_1 with a new version. If that's the intent, there is no need for the first 49 iterations. If that's a typo, the program is still inefficient. Assuming that the output should actually be named ASE.COM_&R the process should be simplified so that the interior of the loop reads:

data ase.com_&R ;
Merge ase.lp%EVAL(&R-1) ase.lp&R. ;
By companyid ;
run ;

The first iteration would need a little extra attention, so that you don't merge a data set with itself. But once you get to the 50th iteration, there is no need to merge 50 data sets. Just merge 2 data sets (one of which is the merged result from the 49th iteration).

In summary, these are just a couple of examples. The program was designed looking at only small pieces of the process and using macro language to repeat them 50 times. A better approach would look at the larger process, and design something that requires much less work on the computer's part.

Reeza · Posted 10-28-2017 02:02 PM

How many rows are your data sets?

EJAA · Posted 10-30-2017 12:16 AM

The number of rows for the data set with 9700 firms is 19725636 while the other is having 7927 rows. Thanks.

ChrisNZ · Posted 10-29-2017 11:16 PM

What @Astounding said.

You seem to be doing the same thing over and over gain on tiny bits of data.

Please post and small data example of where you start from and what you want to obtain, and it can probably be derived in a few steps.

High-Performance SAS Coding - Third Edition

EJAA · Posted 10-30-2017 01:03 AM

Hi ChrisNZ

Thanks for the direction.

Please attached sample data.Thanks in advance for the help. regards

ChrisNZ · Posted 10-30-2017 01:25 AM

Please don't ask us to do all your work.
Please post a small data example of where you start from and what you want to obtain. Typically 100 observations is sufficient.

High-Performance SAS Coding - Third Edition

EJAA · Posted 10-30-2017 01:52 AM

Thanks ChrisNZ.

Attached is a small sample of my dataset with reference to the nature of the data. In the end i want to obtain regression output for each firm. Thanks.

ChrisNZ · Posted 10-30-2017 04:37 PM

Feel free to not bother supplying the basic information that's necessary to answer your question, and to not getting answers.

Here's an example of how it's done: https://communities.sas.com/t5/General-SAS-Programming/Creating-time-intervals-based-on-two-date-var...

High-Performance SAS Coding - Third Edition

EJAA · Posted 10-30-2017 10:29 PM

Hi ChrisNZ.

Thanks once again.

Following your direction, this is how my dataset looks like:

Data set “help” contains the number of firms connected to each firm.d1-d3 are companyid’s of firms connected to for instance firm with lpermno 10001.

CompanyID   lpermno           d1                 d2
10675      10001          8433        1723398
620983      10002          576969    8433
959            10025     576969      1723398
9137          10028     2070824 18195

Data set “help1” contains daily volume and skt_returns for 9700 firms.

CompanyID   lpermno           date              volume           stk_returns
10675      10001          01/03/2000 142             0.40
10675        10001           01/04/2000    245                 0.45
10675       10001           01/05/2000    285                 0.42
10675        10001     01/07/2000   181                    0.43

10675      10001          01/10/2000 149             0.44
620983    10002          01/03/2000 200             0.30
620983   10002           01/04/2000    280                 0.35
620983    10002           01/05/2000    230                 0.32
620983     10002     01/07/2000   190                    0.33

620983 10002 01/10/2000 240 0.34

8433        10003          01/03/2000 300             0.10
8433         10003           01/04/2000    380                 0.15
8433        10003          01/05/2000    330                 0.12
8433         10003     01/07/2000   390                    0.13

8433 10003 01/10/2000 340 0.14

576969     10004          01/03/2000 400             0.80
576969    10004           01/04/2000    480                 0.85
576969      10004          01/05/2000    430                 0.82
576969      10004     01/07/2000   490                    0.83

576969 10004 01/10/2000 440 0.84

1723398     10000          01/03/2000 500             0.60
1723398    10000           01/04/2000 580                 0.65
1723398      10000          01/05/2000    530                 0.62
1723398     10000     01/07/2000   590                    0.63

1723398 10000 01/10/2000 500 0.64

Final Table for firm with id 10001.

CompanyID   lpermno           date              volume           stk_returns beta_volume (obtained from firms connected to firm 10001)
10675      10001          01/03/2000 142             0.40 21
10675        10001           01/04/2000    245                 0.45           30
10675       10001           01/05/2000    285                 0.42 40
10675        10001     01/07/2000   181                    0.43 84

10675 10001 01/10/2000 149 0.44 41

From data set help, I first create a portfolio of firms connected to each firm from data set help1 to obtain portfolio of their volume called beta_volume (mean of volume for each day). I then created dataset for each firm obtain the "final table" shown above. In the end, I run a regression where I regress volume on beta volume and stk_returns for each firm. This the work i used the earlier codes i posted to do.Thanks very much.

regards

ejaa.

ChrisNZ · Posted 10-31-2017 01:24 AM

Please provide a sample small enough that we know how the value 21 (beta_volume) that you are showing in the desired output is calculated. We can't see where it comes from here.

High-Performance SAS Coding - Third Edition

EJAA · Posted 10-31-2017 01:34 AM

Please the beta_volume is the daily average/mean of the firms connected to a specific firm. So for instance in the example given, firm 10001 is connected to two firms thus 8433 & 1723398. Hence i obtain the average of daily volume of these two firms and link it to firm 10001 by date to produce the final table to run the regression.

In my earlier code, this what i used to obtain the mean of firms connected to a specific firm.

%macro rpt5;

%Do R = &C1 %to &C2;

proc summary data = ase.sum&R. ;
by date;
var vol;
output out=ase.sum_&R. mean=beta_vol ;
run;

%end;
%mend rpt5 ;

%rpt5 ;

Thanks very much.

ChrisNZ · Posted 10-31-2017 11:38 PM

I am about to quit trying, The small example you supplied does not show the link you mention between firm 10001 and 8433 & 1723398, and does not show volumes that would explain how 41 is attained.

High-Performance SAS Coding - Third Edition

SAS Programming

Code running slow in SAS

Re: Code running slow in SAS

Re: Code running slow in SAS

Re: Code running slow in SAS

Re: Code running slow in SAS

Re: Code running slow in SAS

Re: Code running slow in SAS

Re: Code running slow in SAS

Re: Code running slow in SAS

Re: Code running slow in SAS

Re: Code running slow in SAS

Re: Code running slow in SAS

Re: Code running slow in SAS

Re: Code running slow in SAS

Re: Code running slow in SAS

Re: Code running slow in SAS

Follow Us

What is...

SAS Programming

Special offer for SAS Communities members

SAS Training: Just a Click Away

Follow Us

What is...