Hello everybody;
I have chosen SAS for technical analysis which I have used for writing my thesis.
I wrote codes which have been shown below. I want to rewrite them to make them simple and well-structured such as a semi-professional programmer. However, I have not enough knowledge about programming.
Here are my codes:
***********************************
*STEP 1: ROUNDING TIME;
***********************************
;
data Sampledata87_RT;
set Sampledata87;
TRD_EVENT_TIME = INPUT(TRD_EVENT_TM,time16.);
TRD_EVENT_ROUNDED = ROUND(TRD_EVENT_TIME,'00:30't);
TRD_EVENT_ROUFOR = PUT(TRD_EVENT_ROUNDED,hhmm.);
***********************************
*STEP 2: CALCULATING INTRADAY VOLUME;
***********************************
;
CountedVOLUME = TRD_PR*TRD_TUROVR;
***********************************
*STEP 3: CALCULATING NORMALIZED VOLUME;
***********************************
;
*Denominator
/*Sort by TRD_STCK_CD and temporal variables.*/;
proc sort data=Sampledata87_RT out=Sampledata87_SumVol;
by TRD_EVENT_DT;
run;
/*Sum VOLUME until the last of each TRD_STCK_CD is reached.*/
data Sampledata87_SumVolSo;
set Sampledata87_SumVol;
by TRD_EVENT_DT
TRD_STCK_CD notsorted;
format TRD_STCK_CD $5.;
informat TRD_STCK_CD $5.;
retain tmp_volume_sum;
tmp_volume_sum + CountedVOLUME;
if last.TRD_STCK_CD then do;
DailyVolume = tmp_volume_sum;
call missing(tmp_volume_sum);
end;
drop tmp_:;
run;
*The numerator
/*Sum VOLUME until the last of each TRD_STCK_CD is reached.*/;
data Sampledata87_SumVolSo;
set Sampledata87_SumVolSo;
by TRD_EVENT_DT
TRD_STCK_CD
TRD_EVENT_ROUFOR notsorted;
retain tmp_intradayvolume_sum;
tmp_intradayvolume_sum + CountedVOLUME;
if last.TRD_EVENT_ROUFOR then do;
IntradayVolume = tmp_intradayvolume_sum;
call missing(tmp_intradayvolume_sum);
end;
drop tmp_:;
run;
* Another way for calculating daily volume based on data set;
/*
proc sql noprint;
create table sums as
select TRD_STCK_CD, TRD_EVENT_DT, sum(CountedVOLUME) as volume_sum
from Sampledata87_SumVolSo
group by TRD_STCK_CD, TRD_EVENT_DT;
create index TRD_STCK_CD on sums;
quit;
data Sampledata87_SumVolSo02;
set Sampledata87_SumVolSo;
by TRD_EVENT_DT
TRD_STCK_CD notsorted;
volume_sum = .;
if last.TRD_STCK_CD then
set sums key=TRD_STCK_CD;
run;
*/;
*Approach 1: Calculating Daily Volume by Data set;
*Division for calculating adjusted volume in approach 1;
proc sort data=sampledata87_sumvolso out=sampledata87_sumvolso;
by TRD_STCK_CD TRD_EVENT_DT;
run;
data sampledata87_adjvol;
do until(last.TRD_STCK_CD);
do until(last.TRD_EVENT_DT);
set sampledata87_sumvolso;
by TRD_STCK_CD TRD_EVENT_DT;
if first.TRD_STCK_CD then
n=0;
if first.TRD_EVENT_DT then
n+1;
if n>1 then
do;
if not missing(IntradayVolume) then
adjusted_volume=divide(IntradayVolume,temp);
else call missing(adjusted_volume);
end;
if last.TRD_EVENT_DT then
temp=dailyvolume;
output;
end;
end;
drop temp n;
run;
proc sort data = sampledata87_adjvol;
by TRD_EVENT_DT TRD_STCK_CD;
run;
*Approach 2: Calculating daily volume by merging tables;
*Changing name & format of table 2 for coordination;
data sampledata87_02;
set sampledata87_02;
Options VALIDVARNAME=ANY;
rename
instrument = TRD_STCK_CD
Trade_Date = TRD_EVENT_DT;
run;
*Merging tables;
proc sort data=Sampledata87_sumvolso; by TRD_EVENT_DT TRD_STCK_CD; run;
proc sort data=Sampledata87_02; by TRD_EVENT_DT TRD_STCK_CD; run;
data Sampledata87_02_Mer;
merge Sampledata87_sumvolso Sampledata87_02;
by TRD_EVENT_DT TRD_STCK_CD;
keep TRD_EVENT_DT TRD_EVENT_TM TRD_STCK_CD TRD_EVENT_ROUNDED TRD_EVENT_ROUFOR CountedVOLUME Volume IntradayVolume;
run;
*Division for calculating normalized volume in approach 2;
proc sort data=Sampledata87_02_Mer out=Sampledata87_02_Mer;
by TRD_STCK_CD TRD_EVENT_DT;
run;
data Sampledata87_02_Mer;
do until(last.TRD_STCK_CD);
do until(last.TRD_EVENT_DT);
set Sampledata87_02_Mer;
by TRD_STCK_CD TRD_EVENT_DT;
if first.TRD_STCK_CD then
n=0;
if first.TRD_EVENT_DT then
n+1;
if n>1 then
do;
if not missing(IntradayVolume) then
adjusted_volume=divide(IntradayVolume,temp);
else call missing(adjusted_volume);
end;
if last.TRD_EVENT_DT then
temp=volume;
output;
end;
end;
drop temp n;
run;
proc sort data = Sampledata87_02_Mer;
by TRD_EVENT_DT TRD_STCK_CD;
run;
***********************************
STEP 4: REGRESSING DUMMY VARIABLES ON NORMALIZED VOLUME VARAIBLE USING AUTOMATICLLY GENERATING DUMMY VARIABLE METHOD
***********************************
;
* Regression with dummy variables in approach 1;
* Regressing dummy variables on normalized volume variable using calculated volume;
proc genmod data=Sampledata87_adjvol;
class TRD_EVENT_ROUFOR / param=effect;
model adjusted_volume = TRD_EVENT_ROUFOR / noscale;
ods select ParameterEstimates;
run;
* Regression with dummy variables in approach 2;
* Regressing dummy variables on normalized volume variable using merged table;
proc genmod data=Sampledata87_02_mer;
class TRD_EVENT_ROUFOR / param=effect;
model adjusted_volume = TRD_EVENT_ROUFOR / noscale;
ods select ParameterEstimates;
run;
Please help me to think this out.
Thank you in advance for your help.
In terms of effeciency you need to avoid doing things twice. For example you copy a dataset just to rename a couple of variables and then you later sort it. You could either use PROC DATASETS to modify the original dataset to avoid having the read and write the data to rename the variables. Or you could just add the RENAME= dataset option to the input to your PROC SORT.
Also avoid re-sorting dataset. Sorting can take a really long time, especially for large datasets.
For example you sort and merge by TRD_STCK_CD TRD_EVENT_DT and then later resort by TRD_EVENT_DT TRD_STCK_CD. If you can process both times in the same order then you could avoid having to resort the data.
In general if your program runs then you could turn on the FULLSTIMER option, run the code, and then look for the steps that take the longest time and concentrate on improving those first. Not much sense it working too hard to speed up something that only take a second.
Not sure what kind of help you are asking for. Is your code documented as well as would be expected from a professional programmer? Yes!
Does it do what you want? Only you can answer that question!
Are there some things you could correct/simplify? There almost always is .. even with production code from professional programmers! Some things are probably under the topic of coding preferences (that is .. things that don't change the was a program runs, but which some of us expect to see in code). For example,
(1) you don't always end a data step with a run; statement. I always like to see such boundaries when reviewing code
(2) while you use the implied sum statement (e.g.,
tmp_volume_sum + CountedVOLUME;
you include a retain tmp_volume_sum statement. It's not needed as the form you used automatically retains the variable.
(3) you have a couple of data steps where you don't take advantage of SAS's normal method of processing. e.g.:
data sampledata87_adjvol; do until(last.TRD_STCK_CD); do until(last.TRD_EVENT_DT); set sampledata87_sumvolso;
by TRD_STCK_CS TRD_EVENT_DT;
Without seeing your data and testing whether your approach does anything differently, my guess is that something like the following does the same thing:
data sampledata87_adjvol; set sampledata87_sumvolso;
by TRD_STCK_CS TRD_EVENT_DT;
Art, CEO, AnalystFinder.com
How big and how long are your processes currently taking?
Which parts are inefficient?
@aminkarimid wrote:
Thanks Reeza;
I don't know which parts are inefficient.
Please tell me tips to rewrite my codes, such as combination or omitting the codes.
I'm going to strongly agree with @ballardw here. It's better to fully understand your code and what it does, and how to change it, rather than it to be efficient. Since you're new to SAS and analytics, I would suggest making sure you understand what every single line of your code does. It seems like overkill but commenting each line is a good exercise. Usually when you do this, you naturally see where steps are redundant because you're tracing the process. The other thing that's important is documentation. Especially if you did any data manipulation outside of SAS.
You likely could reduce your use of proc sort. However, without seeing your data, one can't be sure. But, for one, the sort before you run the proc genmods at the end of your code, doesn't seem to be needed.
Art, CEO, AnalystFinder.com
In terms of effeciency you need to avoid doing things twice. For example you copy a dataset just to rename a couple of variables and then you later sort it. You could either use PROC DATASETS to modify the original dataset to avoid having the read and write the data to rename the variables. Or you could just add the RENAME= dataset option to the input to your PROC SORT.
Also avoid re-sorting dataset. Sorting can take a really long time, especially for large datasets.
For example you sort and merge by TRD_STCK_CD TRD_EVENT_DT and then later resort by TRD_EVENT_DT TRD_STCK_CD. If you can process both times in the same order then you could avoid having to resort the data.
In general if your program runs then you could turn on the FULLSTIMER option, run the code, and then look for the steps that take the longest time and concentrate on improving those first. Not much sense it working too hard to speed up something that only take a second.
My $0.02
"Efficiency" is a slippery beast. You may need to define which behaviors between 1) run time; 2) disk space, network bandwidth or other constraint; 3) code writing and 4) code maintenance are more important.
Some code that is very efficient for run time may require lots of disk space or be somewhat difficult to understand (requiring much more time to maintain or make changes)
Simple code may take more time to run but is easier to maintain.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.