- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi All,
I'm running the codes below on two data sets but its taking extremely long to process. I don't really know what is wrong with my codes that is taking too long to process when i run. I have about 9700 firms in one data set and in another data set a list of firms connected to each of the 9700 firms. Due to the slowness i did run from 1-50 firms but that is even not processing after a day. The code gives no errors. Any advice for me to correct the slowness. Thanks in advance for the help. Regards ejaa.
/*****getiing the count of number of lpermno in help*****/
proc sql noprint; select count(lpermno) into: A from ase.help ; quit ;
%put &A ;
/********getting a sequence order to observations**********/
Data ase.help ;
Set ase.help;
Retain seq 0 ;
Seq = seq + 1 ;
run ;
/*******Transpose help to get long format*********************/
proc transpose data=ase.help out=ase.help_trans;
by lpermno seq;
run;
Data ase.help_trans (Rename = (COL1 = companyid)) ;
Set ase.help_trans ;
Where COL1 ne . ;
Run ;
%let C1 = 1 ;
%let c2 = 50 ;
/***********Separating each set of firms attched to lpermno*************/
%macro rpt;
%Do R = &C1 %to &C2;
data ase.lp&R. (rename = (Seq = seq&R.)) ;
Set ase.help_trans ;
Where seq = &R ;
Run ;
proc sort data = ase.lp&R.; by companyid ; run ;
%end ;
%mend rpt ;
%rpt ;
/***********Merge separated sets to help1******************/
Proc sort data = ase.help1 ; By companyid ; run ;
%macro rpt1;
%Do R = &C1 %to &C2;
data ase.com_1 ;
Merge ase.lp1 - ase.lp&R. ;
By companyid ;
run ;
%end ;
%mend rpt1 ;
%rpt1 ;
Proc sort data = ase.com_1 ; By companyid ; run ;
data ase.com ;
Merge ase.help1(in= a drop = lpermno ) ase.com_1 ;
By companyid ;
If a ;
run ;
/**************Select different lpermno from help********************/
%macro rpt2;
%Do R = &C1 %to &C2;
Proc sort data = ase.lp&R. nodupkey out = ase.f&R.; By lpermno ; run ;
%end ;
%mend rpt2 ;
%rpt2 ;
/***************************merge lpermno from help to help 1 *****************************/
Proc sort data = ase.help1 ; By lpermno ; run ;
%macro rpt3;
%Do R = &C1 %to &C2;
Data ase.Com_long1 ;
merge ase.f1 - ase.f&R. ;
by lpermno ;
Run ;
%end ;
%mend rpt3 ;
%rpt3 ;
Data ase.Com_long ;
merge ase.help1 (in = a) ase.Com_long1 ;
by lpermno ;
If a ;
Run ;
/*************Getting different sets of firms out of help1***********************/
%macro rpt4;
%Do R = &C1 %to &C2;
data ase.sum&R. ;
Set ase.com ;
where seq&R. = &R ;
Run ;
Proc sort data = ase.sum&R. ; By date ; run ;
%end ;
%mend rpt4 ;
%rpt4 ;
/*******getting mean prices by date in all the sets*************************/
%macro rpt5;
%Do R = &C1 %to &C2;
proc summary data = ase.sum&R. ;
by date;
var vol;
output out=ase.sum_&R. mean=beta_vol ;
run;
%end;
%mend rpt5 ;
%rpt5 ;
/********************separating lpermno to different sets from help1*********************************/
%macro rpt6;
%Do R = &C1 %to &C2;
data ase.price&R. ;
Set ase.Com_long ;
where seq&R. = &R ;
Run ;
%end ;
%mend rpt6 ;
%rpt6 ;
/***********************Merging matching lpermno and attched firms by date*********************************/
%macro rpt7 ;
%Do R = &C1 %to &C2;
data ase.final&R.(drop = _TYPE_ _FREQ_ _NAME_ ) ;
Merge ase.price&R. (in = a) ase.sum_&R. (in = b) ;
By date ;
If a and b ;
run ;
%end ;
%mend rpt7 ;
%rpt7 ;
/************************Getting regression of each set**********************************/
%macro rpt8;
%Do R = &C1 %to &C2;
proc reg data=ase.final&R. outest = ase.result&R. ;
By lpermno ;
model vol = beta_vol stk ;
run;
%end ;
%mend rpt8 ;
%rpt8 ;
/**********getting results together****************************/
%macro rpt9;
%Do R = &C1 %to &C2;
Data ase.reg1 ;
Set ase.result1 - ase.result&R. ;
run;
%end ;
%mend rpt9 ;
%rpt9 ;
/**********************END********************************/
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I couldn't agree more with @jklaverstijn if you have all these columns.
Try this which turns the table around. The join takes about the same time as the hash table on my example.
Modify it to suit your data.
data VERT;
set HELP;
array A[2] D1-D2;
do I=1 to 2;
D=A[I];
if D then output;
end;
keep COMPANYID LPERMNO D;
run;
proc sql ;
create table W as
select source.COMPANYID, source.LPERMNO, source.DATE, source.VOLUME, mean(betavol.VOLUME) as BETAVOL
from HELP1 source
left join
VERT links
on source.COMPANYID=links.COMPANYID
inner join
HELP1 betavol
on betavol.COMPANYID=links.D
and betavol.DATE =source.DATE
group by 1,2,3,4
order by 1,2,3,4
;
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If you can't get the program to finish my advice would be to run it step by step until you find something that doesn't finish - my guess is it'll be the Proc Reg but if you run it as I suggest at least you'll know where to concentrate your efforts.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Examine your log and look at the "real time" lines after each step. Especially the ones that show a high ratio between real time and cpu time are suspect and candidates for improvement. Also explain your data in more detail. Are they SAS datasets or tables in a DBMS? Do you use indexes? Maybe sorting will help? Performance is an elaborate field of expertise and lots of info is required before anyone can help. Share a sample of your data and above all your logs in this forum. The first commandment of performance analysis is " Thou Shalt Read Log Files". Use FULLSTIMER and MSGLEVEL=I options for more detailed info. And in case of macro's, use at least the MPRINT, SYMBOLGEN and MLOGIC options as well.
Hop this helps,
- Jan.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Just spot-checking the program, the design looks like it will take about 25 times what it needs to. Let me give you just a couple of examples, so you can rethink the process.
Example 1: %RPT
This step reads the data set ASE.HELP_TRANS 50 times, once for each iteration of the %DO loop. Better planning would read the data set just once, and would create all 50 data sets with one pass through the data. Just a brief example to give you the idea:
proc sort data=ase.help_trans;
by company_id;
run;
data ase.lp1 (rename=(seq=seq1))
ase.lp50 (rename=(seq=seq50));
set ase.help_trans;
if seq=1 then output ase.lp1;
else if seq=50 then output ase.lp50;
run;
The use of macro language would get a little more complex than what you have now. Is that worth it, to cut out 95% of the time?
By sorting first, you can eliminate the need for 50 sorts (albeit on smaller data sets) later.
Example 2: %RPT1
It's not clear if this step contains a typo, because it re-creates the same data set 50 times. Each iteration replaces ASE.COM_1 with a new version. If that's the intent, there is no need for the first 49 iterations. If that's a typo, the program is still inefficient. Assuming that the output should actually be named ASE.COM_&R the process should be simplified so that the interior of the loop reads:
data ase.com_&R ;
Merge ase.lp%EVAL(&R-1) ase.lp&R. ;
By companyid ;
run ;
The first iteration would need a little extra attention, so that you don't merge a data set with itself. But once you get to the 50th iteration, there is no need to merge 50 data sets. Just merge 2 data sets (one of which is the merged result from the 49th iteration).
In summary, these are just a couple of examples. The program was designed looking at only small pieces of the process and using macro language to repeat them 50 times. A better approach would look at the larger process, and design something that requires much less work on the computer's part.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
How many rows are your data sets?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The number of rows for the data set with 9700 firms is 19725636 while the other is having 7927 rows. Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
What @Astounding said.
You seem to be doing the same thing over and over gain on tiny bits of data.
Please post and small data example of where you start from and what you want to obtain, and it can probably be derived in a few steps.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi ChrisNZ
Thanks for the direction.
Please attached sample data.Thanks in advance for the help. regards
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Please don't ask us to do all your work.
Please post a small data example of where you start from and what you want to obtain. Typically 100 observations is sufficient.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks ChrisNZ.
Attached is a small sample of my dataset with reference to the nature of the data. In the end i want to obtain regression output for each firm. Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Feel free to not bother supplying the basic information that's necessary to answer your question, and to not getting answers.
Here's an example of how it's done: https://communities.sas.com/t5/General-SAS-Programming/Creating-time-intervals-based-on-two-date-var...
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi ChrisNZ.
Thanks once again.
Following your direction, this is how my dataset looks like:
Data set “help” contains the number of firms connected to each firm.d1-d3 are companyid’s of firms connected to for instance firm with lpermno 10001.
CompanyID lpermno d1 d2
10675 10001 8433 1723398
620983 10002 576969 8433
959 10025 576969 1723398
9137 10028 2070824 18195
Data set “help1” contains daily volume and skt_returns for 9700 firms.
CompanyID lpermno date volume stk_returns
10675 10001 01/03/2000 142 0.40
10675 10001 01/04/2000 245 0.45
10675 10001 01/05/2000 285 0.42
10675 10001 01/07/2000 181 0.43
10675 10001 01/10/2000 149 0.44
620983 10002 01/03/2000 200 0.30
620983 10002 01/04/2000 280 0.35
620983 10002 01/05/2000 230 0.32
620983 10002 01/07/2000 190 0.33
620983 10002 01/10/2000 240 0.34
8433 10003 01/03/2000 300 0.10
8433 10003 01/04/2000 380 0.15
8433 10003 01/05/2000 330 0.12
8433 10003 01/07/2000 390 0.13
8433 10003 01/10/2000 340 0.14
576969 10004 01/03/2000 400 0.80
576969 10004 01/04/2000 480 0.85
576969 10004 01/05/2000 430 0.82
576969 10004 01/07/2000 490 0.83
576969 10004 01/10/2000 440 0.84
1723398 10000 01/03/2000 500 0.60
1723398 10000 01/04/2000 580 0.65
1723398 10000 01/05/2000 530 0.62
1723398 10000 01/07/2000 590 0.63
1723398 10000 01/10/2000 500 0.64
Final Table for firm with id 10001.
CompanyID lpermno date volume stk_returns beta_volume (obtained from firms connected to firm 10001)
10675 10001 01/03/2000 142 0.40 21
10675 10001 01/04/2000 245 0.45 30
10675 10001 01/05/2000 285 0.42 40
10675 10001 01/07/2000 181 0.43 84
10675 10001 01/10/2000 149 0.44 41
From data set help, I first create a portfolio of firms connected to each firm from data set help1 to obtain portfolio of their volume called beta_volume (mean of volume for each day). I then created dataset for each firm obtain the "final table" shown above. In the end, I run a regression where I regress volume on beta volume and stk_returns for each firm. This the work i used the earlier codes i posted to do.Thanks very much.
regards
ejaa.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Please provide a sample small enough that we know how the value 21 (beta_volume) that you are showing in the desired output is calculated. We can't see where it comes from here.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Please the beta_volume is the daily average/mean of the firms connected to a specific firm. So for instance in the example given, firm 10001 is connected to two firms thus 8433 & 1723398. Hence i obtain the average of daily volume of these two firms and link it to firm 10001 by date to produce the final table to run the regression.
In my earlier code, this what i used to obtain the mean of firms connected to a specific firm.
%macro rpt5;
%Do R = &C1 %to &C2;
proc summary data = ase.sum&R. ;
by date;
var vol;
output out=ase.sum_&R. mean=beta_vol ;
run;
%end;
%mend rpt5 ;
%rpt5 ;
Thanks very much.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content