A thought, I don't know this -- but, is it that the BY statement implicitly assigns values, and also performs "IF" operations that if you did not use the BY statement you would increase performance on large sets?
The only two uses of the BY statement I can see here is that it 1) asserts a sort on GVIIDKEY and 2) conveys the properties of being sorted by GVIIDKEY in the metadata for DOLLAR_VOLUME. If one can assert that the datasets is already sorted prior to the run of the data step, would it be better (faster) to not use the BY statement, but to relay the property of being sorted, explicitly, via the SORTEDBY dataset option on DOLLAR_VOLUME?
data dollar_volume (sortedby=gviidkey);
set work.filter_;
If raw_return ne . then
raw_return_=abs(raw_return);
/* by gviidkey; */
Comment trading volume <share oustanding =>trading volume as missing;
if cshtrd > cshoc then cshtrd=.;
Comment trading volume*unadjusted price (converted to USD)< 100$ then set missing ;
if cshtrd*prccd_abs_ <100 and n(cshtrd,prccd_abs_)=2 /*This code is still under suspicious*/
then dollar_vol=.; /*This code is still under suspicious*/
else dollar_vol=cshtrd*prccd_abs_; /*This code is still under suspicious*/
label
dollar_vol=daily dollar volume
raw_return_= abs of raw_return
;
run;
... View more