Thanks ChrisNZ for the great efforts.
Please when we look into dataset "help" we see that firm 10001 is having firms with in colum d1 and d2 with companyid 8433 & 1723398. Thus these two firms are connected to firm 10001. On the beta volume sorry abt that, it should be as shown below. Thus the average daily volume of the two firms connected to firm 10001. Thanks once again.regards
Final Table for firm with id 10001.
CompanyID lpermno date volume stk_returns beta_volume (obtained from firms connected to firm 10001)
10675 10001 01/03/2000 142 0.40 400 (thus 300+500/2)
10675 10001 01/04/2000 245 0.45 480
10675 10001 01/05/2000 285 0.42 430
10675 10001 01/07/2000 181 0.43 490
10675 10001 01/10/2000 149 0.44 420
We are getting there.
Please tell me how this performs:
data HELP;
input COMPANYID LPERMNO D1 D2 ;
cards;
10675 10001 8433 1723398
run;
data HELP1;
input COMPANYID LPERMNO DATE : mmddyy10. VOLUME STK_RETURNS ;
cards;
10675 10001 01/03/2000 142 0.40
10675 10001 01/04/2000 245 0.45
10675 10001 01/05/2000 285 0.42
10675 10001 01/07/2000 181 0.43
10675 10001 01/10/2000 149 0.44
8433 10003 01/03/2000 300 0.10
8433 10003 01/04/2000 380 0.15
8433 10003 01/05/2000 330 0.12
8433 10003 01/07/2000 390 0.13
8433 10003 01/10/2000 340 0.14
1723398 10000 01/03/2000 500 0.60
1723398 10000 01/04/2000 580 0.65
1723398 10000 01/05/2000 530 0.62
1723398 10000 01/07/2000 590 0.63
1723398 10000 01/10/2000 500 0.64
run;
data WANT;
set HELP1;
if _N_=1 then do;
dcl hash H(dataset:'HELP');
H.definekey('LPERMNO');
H.definedata('D1','D2');
H.definedone();
dcl hash H1(dataset:'HELP1(rename=(VOLUME=BETA))');
H1.definekey('COMPANYID','DATE');
H1.definedata('BETA');
H1.definedone();
call missing(D1,D2,BETA);
end;
call missing(D1,D2,BETA,BETA1,BETA2);
rc=H.find();
rc=H1.find(key:D1,key:DATE );
BETA1=BETA;
rc=H1.find(key:D2,key:DATE );
BETA2=BETA;
AVG=(BETA1+BETA2)/2;
run;
I left the intermediate calculations in the final table.
COMPANYID | LPERMNO | DATE | VOLUME | STK_RETURNS | D1 | D2 | BETA | BETA1 | BETA2 | rc | AVG |
---|---|---|---|---|---|---|---|---|---|---|---|
10675 | 10001 | 14612 | 142 | 0.40 | 8433 | 1723398 | 500 | 300 | 500 | 0 | 400 |
10675 | 10001 | 14613 | 245 | 0.45 | 8433 | 1723398 | 580 | 380 | 580 | 0 | 480 |
10675 | 10001 | 14614 | 285 | 0.42 | 8433 | 1723398 | 530 | 330 | 530 | 0 | 430 |
10675 | 10001 | 14616 | 181 | 0.43 | 8433 | 1723398 | 590 | 390 | 590 | 0 | 490 |
10675 | 10001 | 14619 | 149 | 0.44 | 8433 | 1723398 | 500 | 340 | 500 | 0 | 420 |
@EJAA So?
Using 8k and 20m rows respectively, this takes 1min 20s on my machine.
data HELP (keep=COMPANYID LPERMNO D1 D2)
HELP1(keep=COMPANYID LPERMNO DATE VOLUME);
do COMPANYID= 1 to 8000;
LPERMNO=COMPANYID+100000;
D1 =int(ranuni(0)*8000);
D2 =int(ranuni(0)*8000);
output HELP;
do DATE='01jan2010'd to '01jan2017'd;
VOLUME=int(ranuni(0)*999);
output HELP1;
end;
end;
format DATE date9.;
run;
data WANT;
if _N_=1 then do;
dcl hash H(dataset:'HELP');
H.definekey('LPERMNO');
H.definedata('D1','D2');
H.definedone();
dcl hash H1(dataset:'HELP1(rename=(VOLUME=BETA))');
H1.definekey('COMPANYID','DATE');
H1.definedata('BETA');
H1.definedone();
end;
set HELP1;
call missing(D1,D2,BETA,BETA1);
RC = H.find();
RC = H1.find(key:D1,key:DATE );
BETA1 = BETA;
RC = H1.find(key:D2,key:DATE );
AVG = (BETA+BETA1)/2;
keep COMPANYID LPERMNO DATE VOLUME AVG;
run;
NOTE: There were 8000 observations read from the data set WORK.HELP.
NOTE: There were 20464000 observations read from the data set WORK.HELP1.
NOTE: Missing values were generated as a result of performing an operation on missing values.
Each place is given by: (Number of times) at (Line):(Column).
5116 at 45:15
NOTE: There were 20464000 observations read from the data set WORK.HELP1.
NOTE: The data set WORK.WANT has 20464000 observations and 10 variables.
NOTE: Compressing data set WORK.WANT decreased size by 27.29 percent.
Compressed is 9101 pages; un-compressed would require 12517 pages.
NOTE: DATA statement used (Total process time):
real time 1:19.48
user cpu time 1:16.51
system cpu time 2.65 seconds
memory 1313268.82k
OS Memory 1341000.00k
Step Count 532 Switch Count 267
Hi ChrisNZ
I sincerely apologies for the late update. I had few issues yesterday and was off hence me not updating you with reference to the codes you sent. Thanks very much for the understanding and really appreciate the great efforts.
Please just as you opinionated earlier on, we are getting there.
I have run the code and it worked perfectly on the sample above and is very fast of which i tried to follow the codes so i can modify the variables to enable me apply the codes to the bigger data i have but didnt work.
Please as i could only indicate the nature of the data when posting the question, I couldnt show much data but in my sample data HELP, the column runs from d1-d313 with most have value (companyid). Thus a specific firm could be linked to more than one firm. For instance a specific firm could be linked to just one while others two while others 20 etc. So firm with id 10001 is linked to more than two firms in the actual but couldnt show all as you requested i only post few sample for assessment and direction.Hence i have something like this below for data help
CompanyID lpermno d1 d2 d3 d4 d5 d6 d7 d8 d9 d10
10675 10001 8433 1723398 2070824 18195
620983 10002 576969 8433
959 10025 576969 1723398
9137 10028 2070824 18195
8433 12540 576969
Also please in the data help1, aside this variables, i have other variables price, market_returns in addition to
CompanyID lpermno date volume stk_returns.
I have tried several times trying to follow your codes but still getting it wrong.
Please we are almost there and i seek for further direction to enable me apply on the main data as the codes work on the small sample data posted earlier. Thanks very much ChrisNZ for this great assistance.regards ejaa.
Hi Ejaa,
More a comment on the datamodel: whenever I see a table that has columns like your d1-d313 a get the feeling that a transposition would make any processing a lot easier and your dataset less sparse. Eg, what happens if there is need for a d314? Compare these approaches:
CompanyID lpermno d1 d2 0675 10001 8433 1723398
with
CompanyID lpermno d 0675 10001 8433 0675 10001 1723398
See this paper for a more indepth discussion.
It would be an easy transformation and could greatly help speed up or at least simplify your code.
Regards,
- Jan
Hi jklaverstijn
Thanks for the paper and directions. Really helped and studied a lot. regards.
I couldn't agree more with @jklaverstijn if you have all these columns.
Try this which turns the table around. The join takes about the same time as the hash table on my example.
Modify it to suit your data.
data VERT;
set HELP;
array A[2] D1-D2;
do I=1 to 2;
D=A[I];
if D then output;
end;
keep COMPANYID LPERMNO D;
run;
proc sql ;
create table W as
select source.COMPANYID, source.LPERMNO, source.DATE, source.VOLUME, mean(betavol.VOLUME) as BETAVOL
from HELP1 source
left join
VERT links
on source.COMPANYID=links.COMPANYID
inner join
HELP1 betavol
on betavol.COMPANYID=links.D
and betavol.DATE =source.DATE
group by 1,2,3,4
order by 1,2,3,4
;
quit;
Hi ChrisNZ,
I followed your codes and directions closely and glad to notify you that it did work perfectly. The problem is now resolved and it worked really faster.Thanks very much ChrisNZ and all who helped solve the query.
regards ejaa.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.