I have a data set "have". In the data set, there are three variables, permno(firm), mon(month) and ret(return). What I want to calculate is that for each firm each month, I would like to calculate a new variable (sumret3), which is the sum of the past three returns. I wrote the following codes.
proc sql;
create table want as
select distinct a.*, sum(b.ret) as sumret3
from have as a left join have as b
on a.permno=b.permno and (a.mon-3<=b.mon<a.mon)
group by a.permno, a.mon;
/*having count(*) =3;*/
quit;
Now the problem I have is that for each firm, some months may not have full three months' returns. Using the code above will give a new value for sumret3 even there is only one past return (Now the sum will be only one value). If there is only two previous returns for that permno, it will give the sum of two returns. I only want to get values for sumret3 if there are three previous returns present. If less than three returns, then give sumret3 a missing value.
I used a contraint /*having count(*) =3;*/. This works, but this will give less observations. For example, the final data set "want" should have 1 million observations. Without using the contraint, "want" has 1 million observations. With the constraint, "want" only has 600,000 observations. I believe this is due to using distinct option in the third line, which will only keep the first observation with missing data.
How to deal with this problem? Also, is there any other efficient methods to implement my idea?
Thanks.
Can you not just left join the data per above back to the original data to get full observations?
proc sql;
create table want as
select ORIGINAL_DATA.*,
ADD_DATA.*
from INITIAL_DATA ORGINAL DATA
left join (select distinct a.*, sum(b.ret) as sumret3
from have as a left join have as b
on a.permno=b.permno and (a.mon-3<=b.mon<a.mon)
group by a.permno, a.mon;
having count(*) =3)
on ...
quit;
It may not be the best way to do it though. You could also try 3 left joins to the main table:
proc sql;
create table WANT as
select BASE.PERMNO,
BASE.MON,
A.RET + B.RET + C.RET as SUMRET
from ORIGINAL_DATA BASE
left join ORIGINAL_DATA A
on BASE.PERMNO=A.PERMNO
and A.MON=BASE.MON-1
left join ORIGINAL_DATA B
on BASE.PERMNO=A.PERMNO
and A.MON=BASE.MON-2
...
quit;
Or maybe even datastep:
proc transpose data=original_data out=inter prefix=mon;
by permno;
var ret;
id mon; /* Assuming month is numeric */
run;
proc sql;
select count(*)-1
into :NUM_OBS
from SASHELP.VCOLUMNS
where LIBNAME="WORK"
and MEMNAME="INTER";
quit;
data want;
set inter;
array mon{&NUM_OBS.};
array sumrets{&NUM_OBS.-2};
do i=3 to &NUM_OBS.;
sumrets{i-}=sum(mon{i-2},mon{i-1},mon{i});
end;
run;
Many ways to approach the problem, just a matter of testing them and seeing which is quicker/fits into your coding easier.
You probably need to pre.process your data, to find the latest 3 returns.This could include order your data by month, counting observations (months) and select max 3 observations. Then go to the SQL summary step.
Review what you want to show with those figures.
Absolute returns on quarterly results? Then what you have noticed is a remark for information consumers.
A relative performance comparable to others? Use a mean. Decide on complete/incomplete month's quarters as a note.
A relative growth evolving in time to overall results. impute the overall mean as missing. You can manipulate the figures and pictures to achieve things as being liked by the targeted goals.
You'd better post some sample data . No data to test.
proc sql;
create table want as
select distinct a.*,(select sum(ret) from have where a.permno=permno and (a.mon-3<=mon<a.mon) ) as sumret3
from have as a ;
quit;
Xia Keshan
Thanks for the help. Here is some data
permno date ret mon
15560 | 19251231 | . | 23112 |
15560 | 19260130 | -0.009009 | 23113 |
15560 | 19260227 | -0.054545 | 23114 |
15560 | 19260331 | -0.298077 | 23115 |
15560 | 19260430 | 0.068493 | 23116 |
15560 | 19260528 | 0.000000 | 23117 |
15560 | 19260630 | -0.089744 | 23118 |
15560 | 19260731 | -0.070423 | 23119 |
15560 | 19260831 | 0.075758 | 23120 |
15560 | 19260930 | -0.070423 | 23121 |
15560 | 19261030 | -0.075758 | 23122 |
15560 | 19261130 | 0.114754 | 23123 |
15560 | 19261231 | 0.000000 | 23124 |
15560 | 19270131 | 0.058824 | 23125 |
15560 | 19270228 | -0.027778 | 23126 |
15560 | 19270331 | 0.014286 | 23127 |
15560 | 19270430 | -0.140845 | 23128 |
15560 | 19270531 | 0.295082 | 23129 |
15560 | 19270630 | -0.012658 | 23130 |
15560 | 19270730 | -0.025641 | 23131 |
I only use 20 observations for one firm(permno) as an example. Here I standardized mon as a number, it will not affect the use of it. Since I want to calculate the sum of ret ONLY IF three observations are available, sum(ret) should have values start from the 4th column. You code didn't show the desired results. Moreover, I am not sure why your code took incredibly long time to execute. Thanks for the help.
" I am not sure why your code took incredibly long time to execute."
That is because I am using sub-query of SQL.
OK. Since you posted some data to make the question more clear.
Assuming these month are consecutive . i.e. no missing month between two row .
data have; input permno date ret mon ; cards; 15560 19251231 . 23112 15560 19260130 -0.009009 23113 15560 19260227 -0.054545 23114 15560 19260331 -0.298077 23115 15560 19260430 0.068493 23116 15560 19260528 0.000000 23117 15560 19260630 -0.089744 23118 15560 19260731 -0.070423 23119 15560 19260831 0.075758 23120 15560 19260930 -0.070423 23121 15560 19261030 -0.075758 23122 15560 19261130 0.114754 23123 15560 19261231 0.000000 23124 15560 19270131 0.058824 23125 15560 19270228 -0.027778 23126 15560 19270331 0.014286 23127 15560 19270430 -0.140845 23128 15560 19270531 0.295082 23129 15560 19270630 -0.012658 23130 15560 19270730 -0.025641 23131 15561 19270430 -0.140845 23128 15561 19270531 0.295082 23129 15561 19270630 -0.012658 23130 15561 19270730 -0.025641 23131 ; run; proc sort data=have ;by permno mon;run; data want; set have; by permno; if first.permno then n=0; n+1; sumret3=sum(lag(ret),lag2(ret),lag3(ret)); if n lt 4 then sumret3=.; run;
Xia Keshan
Message was edited by: xia keshan
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.