I want to drop stocks in a month with less than five non-zero returns. For example stock 1 has 7 daily returns in March 2018 but only three non-zero returns. Therefore, I want to delete daily returns of stock 1 for March 2018. I have millions of observations, so I would appreciate an efficient code. Please find the SAS data below.
Thanks a lot for your help.
Best,
Cheema
data have;
infile cards expandtabs truncover;
input stock date : yymmdd10. ret ;
format date yymmdd10.;
cards;
1 2018-03-01 0
1 2018-03-02 0.1
1 2018-03-03 0
1 2018-03-04 0.03
1 2018-03-05 0
1 2018-03-06 0.02
1 2018-03-07 0
2 2018-03-01 0.02
2 2018-03-02 0.03
2 2018-03-03 0.04
2 2018-03-04 0.05
2 2018-03-05 0.06
2 2018-03-06 0.07
2 2018-03-07 0.08
2 2018-03-08 0
1 2018-04-01 0.02
1 2018-04-02 0.03
1 2018-04-03 0.04
1 2018-04-04 0.05
1 2018-04-05 0.06
1 2018-04-06 0.07
1 2018-04-07 0.08
2 2018-04-01 0.02
2 2018-04-02 0.03
2 2018-04-03 0.04
2 2018-04-04 0.05
2 2018-04-05 0.06
2 2018-04-06 0.07
2 2018-04-07 0.08
2 2018-04-08 0
run;
I believe below code should perform reasonably well. I had to set the threshold count to 4 (and not 5) to return some rows using the sample data you've posted.
data have;
infile cards expandtabs truncover;
input stock date : yymmdd10. ret ;
format date yymmdd10.;
cards;
1 2018-03-01 0
1 2018-03-02 0.1
1 2018-03-03 0
1 2018-03-04 0.03
1 2018-03-05 0
1 2018-03-06 0.02
1 2018-03-07 0
2 2018-03-01 0.02
2 2018-03-02 0.03
2 2018-03-03 0.04
2 2018-03-04 0.05
2 2018-03-05 0.06
2 2018-03-06 0.07
2 2018-03-07 0.08
2 2018-03-08 0
1 2018-04-01 0.02
1 2018-04-02 0.03
1 2018-04-03 0.04
1 2018-04-04 0.05
1 2018-04-05 0.06
1 2018-04-06 0.07
1 2018-04-07 0.08
2 2018-04-01 0.02
2 2018-04-02 0.03
2 2018-04-03 0.04
2 2018-04-04 0.05
2 2018-04-05 0.06
2 2018-04-06 0.07
2 2018-04-07 0.08
2 2018-04-08 0
;
run;
%let treshold_cnt=4;
data want(drop=_:);
if _n_=1 then
do;
length _cnt 3;
_cnt=1;
dcl hash h1(suminc: '_cnt', multidata:'n');
h1.defineKey('stock','date');
h1.defineDone();
do i=1 to nobs;
set have nobs=nobs point=i;
date=intnx('month',date,0,'b');
if ret=0 then h1.ref();
end;
end;
set have;
h1.sum(key:stock, key:intnx('month',date,0,'b'), sum:_zero_ret_cnt);
if _zero_ret_cnt<&treshold_cnt then delete;
run;
I believe below code should perform reasonably well. I had to set the threshold count to 4 (and not 5) to return some rows using the sample data you've posted.
data have;
infile cards expandtabs truncover;
input stock date : yymmdd10. ret ;
format date yymmdd10.;
cards;
1 2018-03-01 0
1 2018-03-02 0.1
1 2018-03-03 0
1 2018-03-04 0.03
1 2018-03-05 0
1 2018-03-06 0.02
1 2018-03-07 0
2 2018-03-01 0.02
2 2018-03-02 0.03
2 2018-03-03 0.04
2 2018-03-04 0.05
2 2018-03-05 0.06
2 2018-03-06 0.07
2 2018-03-07 0.08
2 2018-03-08 0
1 2018-04-01 0.02
1 2018-04-02 0.03
1 2018-04-03 0.04
1 2018-04-04 0.05
1 2018-04-05 0.06
1 2018-04-06 0.07
1 2018-04-07 0.08
2 2018-04-01 0.02
2 2018-04-02 0.03
2 2018-04-03 0.04
2 2018-04-04 0.05
2 2018-04-05 0.06
2 2018-04-06 0.07
2 2018-04-07 0.08
2 2018-04-08 0
;
run;
%let treshold_cnt=4;
data want(drop=_:);
if _n_=1 then
do;
length _cnt 3;
_cnt=1;
dcl hash h1(suminc: '_cnt', multidata:'n');
h1.defineKey('stock','date');
h1.defineDone();
do i=1 to nobs;
set have nobs=nobs point=i;
date=intnx('month',date,0,'b');
if ret=0 then h1.ref();
end;
end;
set have;
h1.sum(key:stock, key:intnx('month',date,0,'b'), sum:_zero_ret_cnt);
if _zero_ret_cnt<&treshold_cnt then delete;
run;
data have;
infile cards expandtabs truncover;
input stock date : yymmdd10. ret ;
month=month(date);
format date yymmdd10.;
cards;
1 2018-03-01 0
1 2018-03-02 0.1
1 2018-03-03 0
1 2018-03-04 0.03
1 2018-03-05 0
1 2018-03-06 0.02
1 2018-03-07 0
2 2018-03-01 0.02
2 2018-03-02 0.03
2 2018-03-03 0.04
2 2018-03-04 0.05
2 2018-03-05 0.06
2 2018-03-06 0.07
2 2018-03-07 0.08
2 2018-03-08 0
1 2018-04-01 0.02
1 2018-04-02 0.03
1 2018-04-03 0.04
1 2018-04-04 0.05
1 2018-04-05 0.06
1 2018-04-06 0.07
1 2018-04-07 0.08
2 2018-04-01 0.02
2 2018-04-02 0.03
2 2018-04-03 0.04
2 2018-04-04 0.05
2 2018-04-05 0.06
2 2018-04-06 0.07
2 2018-04-07 0.08
2 2018-04-08 0
;
run;
proc sql;
create table want as
select *
from have
group by stock,month
having sum(ret ne 0)>=5 ;
quit;
You've asked for "millions of observations, so I would appreciate an efficient code".
I've made the assumption your source data is a SAS table. If so then the data step hash approach will outperform the SQL. Should your source data be in a database then the SQL is the option of choice.
If the SQL performs still good enough then I'd go for the SQL as besides of runtimes it's also always about code maintenance - and the SQL is certainly easier to understand and maintain.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.