Solved: Re: Code running slow in SAS - Page 2

EJAA · Posted 10-31-2017 11:52 PM

Thanks ChrisNZ for the great efforts.

Please when we look into dataset "help" we see that firm 10001 is having firms with in colum d1 and d2 with companyid 8433 & 1723398. Thus these two firms are connected to firm 10001. On the beta volume sorry abt that, it should be as shown below. Thus the average daily volume of the two firms connected to firm 10001. Thanks once again.regards

Final Table for firm with id 10001.

CompanyID   lpermno           date              volume           stk_returns beta_volume (obtained from firms connected to firm 10001)
10675      10001          01/03/2000 142             0.40 400 (thus 300+500/2)
10675        10001           01/04/2000    245                 0.45           480
10675       10001           01/05/2000    285                 0.42   430
10675        10001     01/07/2000   181                    0.43   490

10675 10001 01/10/2000 149 0.44 420

ChrisNZ · Posted 11-01-2017 12:51 AM

We are getting there.

Please tell me how this performs:


data HELP;
 input COMPANYID   LPERMNO  D1  D2  ;                       
cards;
10675    10001   8433     1723398   
run;
 
data HELP1;
input COMPANYID   LPERMNO   DATE : mmddyy10.  VOLUME  STK_RETURNS ;
cards;
10675    10001     01/03/2000    142     0.40
10675    10001     01/04/2000    245     0.45         
10675    10001     01/05/2000    285     0.42
10675    10001     01/07/2000    181     0.43      
10675    10001     01/10/2000    149     0.44
8433     10003     01/03/2000    300     0.10
8433     10003     01/04/2000    380     0.15         
8433     10003     01/05/2000    330     0.12
8433     10003     01/07/2000    390     0.13      
8433     10003     01/10/2000    340     0.14
1723398  10000     01/03/2000    500     0.60
1723398  10000     01/04/2000    580     0.65         
1723398  10000     01/05/2000    530     0.62
1723398  10000     01/07/2000    590     0.63      
1723398  10000     01/10/2000    500     0.64
run;

data WANT;
  set HELP1;
  if _N_=1 then do;
    dcl hash H(dataset:'HELP');
    H.definekey('LPERMNO');
    H.definedata('D1','D2');
    H.definedone();
    dcl hash H1(dataset:'HELP1(rename=(VOLUME=BETA))');
    H1.definekey('COMPANYID','DATE');
    H1.definedata('BETA');
    H1.definedone();
    call missing(D1,D2,BETA);
  end;
  call missing(D1,D2,BETA,BETA1,BETA2);
  rc=H.find();
  rc=H1.find(key:D1,key:DATE );
  BETA1=BETA;
  rc=H1.find(key:D2,key:DATE );
  BETA2=BETA;
  AVG=(BETA1+BETA2)/2;
run;

I left the intermediate calculations in the final table.

COMPANYID	LPERMNO	DATE	VOLUME	STK_RETURNS	D1	D2	BETA	BETA1	BETA2	AVG
10675	10001	14612	142	0.40	8433	1723398	500	300	500	400
10675	10001	14613	245	0.45	8433	1723398	580	380	580	480
10675	10001	14614	285	0.42	8433	1723398	530	330	530	430
10675	10001	14616	181	0.43	8433	1723398	590	390	590	490
10675	10001	14619	149	0.44	8433	1723398	500	340	500	420

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 11-02-2017 12:39 AM

@EJAA So?

Using 8k and 20m rows respectively, this takes 1min 20s on my machine.

data HELP (keep=COMPANYID LPERMNO D1 D2)
     HELP1(keep=COMPANYID LPERMNO DATE VOLUME);
  do COMPANYID= 1 to 8000;
    LPERMNO=COMPANYID+100000;
    D1     =int(ranuni(0)*8000);
    D2     =int(ranuni(0)*8000);
    output HELP;
    do DATE='01jan2010'd to '01jan2017'd;
      VOLUME=int(ranuni(0)*999);
      output HELP1;
    end;
  end;
  format DATE date9.;
run;


data WANT;
  if _N_=1 then do;
    dcl hash H(dataset:'HELP');
    H.definekey('LPERMNO');
    H.definedata('D1','D2');
    H.definedone();
    dcl hash H1(dataset:'HELP1(rename=(VOLUME=BETA))');
    H1.definekey('COMPANYID','DATE');
    H1.definedata('BETA');
    H1.definedone();
  end;
  set HELP1;
  call missing(D1,D2,BETA,BETA1);
  RC    = H.find();
  RC    = H1.find(key:D1,key:DATE );
  BETA1 = BETA;
  RC    = H1.find(key:D2,key:DATE );
  AVG   = (BETA+BETA1)/2;
  keep COMPANYID LPERMNO DATE VOLUME AVG;
run;

NOTE: There were 8000 observations read from the data set WORK.HELP.
NOTE: There were 20464000 observations read from the data set WORK.HELP1.
NOTE: Missing values were generated as a result of performing an operation on missing values.
Each place is given by: (Number of times) at (Line):(Column).
5116 at 45:15
NOTE: There were 20464000 observations read from the data set WORK.HELP1.
NOTE: The data set WORK.WANT has 20464000 observations and 10 variables.
NOTE: Compressing data set WORK.WANT decreased size by 27.29 percent.
Compressed is 9101 pages; un-compressed would require 12517 pages.
NOTE: DATA statement used (Total process time):
real time 1:19.48
user cpu time 1:16.51
system cpu time 2.65 seconds
memory 1313268.82k
OS Memory 1341000.00k
Step Count 532 Switch Count 267

High-Performance SAS Coding - Third Edition

EJAA · Posted 11-02-2017 02:57 AM

Hi ChrisNZ

I sincerely apologies for the late update. I had few issues yesterday and was off hence me not updating you with reference to the codes you sent. Thanks very much for the understanding and really appreciate the great efforts.

Please just as you opinionated earlier on, we are getting there.

I have run the code and it worked perfectly on the sample above and is very fast of which i tried to follow the codes so i can modify the variables to enable me apply the codes to the bigger data i have but didnt work.

Please as i could only indicate the nature of the data when posting the question, I couldnt show much data but in my sample data HELP, the column runs from d1-d313 with most have value (companyid). Thus a specific firm could be linked to more than one firm. For instance a specific firm could be linked to just one while others two while others 20 etc. So firm with id 10001 is linked to more than two firms in the actual but couldnt show all as you requested i only post few sample for assessment and direction.Hence i have something like this below for data help

CompanyID   lpermno           d1                 d2 d3 d4 d5 d6 d7 d8 d9 d10
10675      10001          8433        1723398     2070824   18195
620983      10002          576969    8433
959            10025     576969      1723398
9137          10028     2070824 18195

8433 12540 576969

Also please in the data help1, aside this variables, i have other variables price, market_returns in addition to

CompanyID lpermno date volume stk_returns.

I have tried several times trying to follow your codes but still getting it wrong.

Please we are almost there and i seek for further direction to enable me apply on the main data as the codes work on the small sample data posted earlier. Thanks very much ChrisNZ for this great assistance.regards ejaa.

jklaverstijn · Posted 11-02-2017 09:06 AM

Hi Ejaa,

More a comment on the datamodel: whenever I see a table that has columns like your d1-d313 a get the feeling that a transposition would make any processing a lot easier and your dataset less sparse. Eg, what happens if there is need for a d314? Compare these approaches:

CompanyID lpermno   d1   d2   
0675      10001     8433 1723398

with

CompanyID  lpermno   d
0675       10001     8433
0675       10001     1723398

See this paper for a more indepth discussion.

It would be an easy transformation and could greatly help speed up or at least simplify your code.

Regards,

- Jan

EJAA · Posted 11-03-2017 09:51 AM

Hi jklaverstijn

Thanks for the paper and directions. Really helped and studied a lot. regards.

ChrisNZ · Posted 11-02-2017 05:28 PM

I couldn't agree more with @jklaverstijn if you have all these columns.

Try this which turns the table around. The join takes about the same time as the hash table on my example.

Modify it to suit your data.


data VERT;
  set HELP;
  array A[2] D1-D2;
  do I=1 to 2;
    D=A[I];
    if D then output;
  end;
  keep COMPANYID LPERMNO D;
run;  

proc sql ;
  create table W as 
  select source.COMPANYID, source.LPERMNO, source.DATE, source.VOLUME, mean(betavol.VOLUME) as BETAVOL 
  from HELP1     source
         left join 
       VERT      links
         on source.COMPANYID=links.COMPANYID
         inner join 
       HELP1     betavol
         on  betavol.COMPANYID=links.D
         and betavol.DATE     =source.DATE
  group by 1,2,3,4
  order by 1,2,3,4
;
quit;

High-Performance SAS Coding - Third Edition

EJAA · Posted 11-03-2017 09:49 AM

Hi ChrisNZ,

I followed your codes and directions closely and glad to notify you that it did work perfectly. The problem is now resolved and it worked really faster.Thanks very much ChrisNZ and all who helped solve the query.

regards ejaa.

Classroom Training Available!