Hi, greetings,
First of all, I sincerely thank all of you for saving my life many times (I wish I could pay you for help).
The current difficulty I am going through is the following:
I am dealing with a large dataset of daily stock returns.
When a company incurs an event on a certain day, I am going to extract three stock returns on that day, the previous day, and the next day. So, if we designate the event day as zero, the event window can be formalized as window (-1, +1): further, I am going to calculate the mean of the three returns.
Based on the large daily data, I identified such event days from numerous companies by creating a dummy t (=1, missing otherwise). Then, using the following codes,
data window; set daily_stock; by cusip; if t>. then do p=max(_n_-1,1,p+1) to min(_n_+1,nobs);
set daily_stock point=p nobs=nobs; output; end; run;
I obtained the results below.
CUSIP | DATE | RET | t | Year | Month | newvar |
00030710 | 2/24/2017 | -0.021 | 2017 | 2 | 1 | |
00030710 | 2/27/2017 | 0.001 | 1 | 2017 | 2 | 1 |
00030710 | 2/28/2017 | 0.208 | 2017 | 2 | 1 | |
00030710 | 5/2/2017 | 0.097 | 2017 | 5 | 2 | |
00030710 | 5/3/2017 | -0.055 | 1 | 2017 | 5 | 2 |
00030710 | 5/4/2017 | 0.077 | 2017 | 5 | 2 | |
00030710 | 8/1/2017 | 0.032 | 2017 | 8 | 3 | |
00030710 | 8/2/2017 | -0.063 | 1 | 2017 | 8 | 3 |
00030710 | 8/3/2017 | 0.203 | 2017 | 8 | 3 | |
00030710 | 10/31/2017 | -0.034 | 2017 | 10 | 4 | |
00030710 | 11/1/2017 | -0.047 | 1 | 2017 | 11 | 4 |
00030710 | 11/2/2017 | 0.327 | 2017 | 11 | 4 | |
00030710 | 2/20/2018 | -0.023 | 2018 | 2 | 5 | |
00030710 | 2/21/2018 | 0.031 | 1 | 2018 | 2 | 5 |
00030710 | 2/22/2018 | 0.072 | 2018 | 2 | 5 | |
00077R10 | 9/22/1998 | 0.014 | 1998 | 9 | 1 | |
00077R10 | 9/23/1998 | 0.108 | 1 | 1998 | 9 | 1 |
00077R10 | 9/24/1998 | -0.308 | 1998 | 9 | 1 | |
00077R10 | 12/2/1998 | -0.019 | 1998 | 12 | 2 | |
00077R10 | 12/3/1998 | -0.125 | 1 | 1998 | 12 | 2 |
00077R10 | 12/4/1998 | 0.035 | 1998 | 12 | 2 | |
00077R10 | 3/3/1999 | 0.010 | 1999 | 3 | 3 | |
00077R10 | 3/4/1999 | 0.023 | 1 | 1999 | 3 | 3 |
00077R10 | 3/5/1999 | 0.088 | 1999 | 3 | 3 | |
00077R10 | 4/29/1999 | 0.074 | 1999 | 4 | 3 | |
00077R10 | 4/30/1999 | 0.018 | 1 | 1999 | 4 | 3 |
00077R10 | 5/3/1999 | 0.429 | 1999 | 5 | 3 |
From here, I need to create the group of three consecutive observations with "t" in the middle. In other words, I need to identify "t", a row above "t", and a row below "t" in a group.
At first, this task looked very simple. So, I created year and month variables, and I thought grouping by year-months will accomplish it. However, there are cases in which a three-day window covers more than one month. So, I created the "newvar" that increases by 1 only when month jumps by 2 or more. But then again, there are three-day windows lying side by side in consecutive months (See cusip 00077R10 in 1999 over months 3, 4, and 5).
So, please give me a hand to solve this. I want to be very rigorous and precise, since I have to submit the data and codes together with my manuscript, to a journal.
Many thanks in advance!!
Sincerely,
KS -,
Your code works but is likely slow, since it depends on direct access (instead of sequential) to support the SET with POINT=.
Assuming:
This means you will always have complete triplets available. And you can use the code below to both filter through the triplets, and to assign a newvar which (I presume) is a sequential event id called NEWVAR. NEWVAR increases with each new triplet:
data want (drop=nxt_t);
merge have
have (firstobs=2 keep=t rename=(t=nxt_t));
if (nxt_t=1 or t=1 or lag(t)=1 ) ;
if nxt_t=1 then newvar+1;
run;
What do you actually mean by a "three day window"?
Do you just the three observations in a row? Even if they they are many days apart?
Do you mean three calendar dates? But most stock markets are closed on week-ends and holidays. Do you know what dates to exclude ?
If you just want to use the actual observations try a look back - look ahead technique. Look back is easy using LAG() function. Look ahead takes a little finagling. Here is a method using an extra SET statement.
data have ;
input CUSIP $ DATE :mmddyy. RET t ;
format date yymmdd10.;
cards;
00030710 2/24/2017 -0.021 . 2017 2 1
00030710 2/27/2017 0.001 1 2017 2 1
00030710 2/28/2017 0.208 . 2017 2 1
00030710 5/2/2017 0.097 . 2017 5 2
00030710 5/3/2017 -0.055 1 2017 5 2
00030710 5/4/2017 0.077 . 2017 5 2
00030710 8/1/2017 0.032 . 2017 8 3
00030710 8/2/2017 -0.063 1 2017 8 3
00030710 8/3/2017 0.203 . 2017 8 3
00030710 10/31/2017 -0.034 . 2017 10 4
00030710 11/1/2017 -0.047 1 2017 11 4
00030710 11/2/2017 0.327 . 2017 11 4
00030710 2/20/2018 -0.023 . 2018 2 5
00030710 2/21/2018 0.031 1 2018 2 5
00030710 2/22/2018 0.072 . 2018 2 5
00077R10 9/22/1998 0.014 . 1998 9 1
00077R10 9/23/1998 0.108 1 1998 9 1
00077R10 9/24/1998 -0.308 . 1998 9 1
00077R10 12/2/1998 -0.019 . 1998 12 2
00077R10 12/3/1998 -0.125 1 1998 12 2
00077R10 12/4/1998 0.035 . 1998 12 2
00077R10 3/3/1999 0.010 . 1999 3 3
00077R10 3/4/1999 0.023 1 1999 3 3
00077R10 3/5/1999 0.088 . 1999 3 3
00077R10 4/29/1999 0.074 . 1999 4 3
00077R10 4/30/1999 0.018 1 1999 4 3
00077R10 5/3/1999 0.429 . 1999 5 3
;
data want;
set have ;
by cusip date;
prev_ret=lag(ret);
if first.cusip then prev_ret=.;
set have(firstobs=2 keep=RET rename=(ret=next_ret)) have(obs=1 drop=_all_);
if last.cusip then next_ret=.;
n_ret=n(prev_ret,ret,next_ret);
mean_ret = mean(prev_ret,ret,next_ret);
run;
If you did want to use actual dates then you could just expand your existing data to fill in the gaps in days by carrying forward the current value into the missing dates. Then apply the same look ahead method to the resulting datasets.
I have already extracted the closest two days surrounding the event day. The event day is 1 on variable t, and the previous and next days are the missing obs. one row above and below the t=1. So, t=1 is important.
The code I posted calculate the mean(RET) of (up to) three observations .
If did nothing to the T variable so you can use the T variable for the next steps in your analysis.
For example you might delete the observations where T is missing.
Your code works but is likely slow, since it depends on direct access (instead of sequential) to support the SET with POINT=.
Assuming:
This means you will always have complete triplets available. And you can use the code below to both filter through the triplets, and to assign a newvar which (I presume) is a sequential event id called NEWVAR. NEWVAR increases with each new triplet:
data want (drop=nxt_t);
merge have
have (firstobs=2 keep=t rename=(t=nxt_t));
if (nxt_t=1 or t=1 or lag(t)=1 ) ;
if nxt_t=1 then newvar+1;
run;
Thank you so much, mkeintz!
Can you explain the line here specified?
have (firstobs=2 keep=t rename=(t=nxt_t));
I'd like to understand and learn it.
Supposing I am going to do the same thing, this time, for the window (-2,+2) surrounding t=1, should I simply replace the numbers like this?
data want (drop=nxt_t);
merge have
have (firstobs=4 keep=t rename=(t=nxt_t));
if (nxt_t=1 or t=1 or lag2(t)=1 ) ;
if nxt_t=1 then newvar+1;
run;
firstobs=4 instead of firstobs=2, and lag2(t)=1 instead of lag(t)=1, like I have done above?
What is your goal?
Say you had 10 observations. Here is how those observations are going to contribute.
MAIN LAG2() FIRSTOBS=4
1 . 4
2 . 5
3 1 6
4 2 7
5 3 8
6 4 9
7 5 10
8 6 10
9 7 10
10 8 10
If you actually want to compare 5 observations on "days" -2,-1,0,+1,+2 then you need something different.
Can you figure that out?
Thank you Tom ,
Based on mkeintz's codes, I figured it out.
data Temp2; merge Temp1 Temp1 (firstobs=2 keep=t rename=(t=nxt_t)); run;
data Temp3(drop=nxt_t nxt2_t); merge Temp2 Temp2 (firstobs=2 keep=nxt_t rename=(nxt_t=nxt2_t));
if (nxt2_t=1 or nxt_t=1 or t=1 or lag(t)=1 or lag2(t)=1); if nxt2_t=1 then newvar+1; run;
Although it is a primitive extension of his codes, it works well.
Don't worry Tom.
POINT=? You must be thinking of some other code.
Note that many shops will have options set that will trigger an error if you try to use MERGE without a BY statement.
I am not sure how that finds the mean of anything. I guess the idea is to then run the results through PROC MEANS or some other step to do the actual analysis?
POINT=? You must be thinking of some other code.
I was thinking of the OP's code:
data window; set daily_stock; by cusip; if t>. then do p=max(_n_-1,1,p+1) to min(_n_+1,nobs); set daily_stock point=p nobs=nobs; output; end; run;
which created a data set that (I presume) the OP figured would need a second data step.to generate event id's. A good opportunity to speed up the original code by doing both tasks in a single step without the point=
Note that many shops will have options set that will trigger an error if you try to use MERGE without a BY statement.
I am aware that SAS can be told to WARN users of a MERGE without BY. Didn't know it could be upgraded to ERROR. Regardless, those shops are taking away a robust technique for doing lookaheads.
This is especially true in the presence of BY groups. If you actually have BY groups, then a merge statement accompanied by BY would fail to preserve the lookahead enabled by firstobs=2. Once the second by group starts, the lookahead disappears. So my technique for lookahead in the presence of by-groups is to tie the BY statement to a SET and use MERGE without BY.
data want;
set have (keep=id);
by id;
merge have
have (firstobs=2 keep=a b c rename=(a=nxta b=nxtb c=nxtc));
/* other code here that makes use of nxta, nxtb, nxtc, first.id and last.id*/
run;
I am not sure how that finds the mean of anything. I guess the idea is to then run the results through PROC MEANS or some other step to do the actual analysis?
Yup.
So you are back to the method I posted.
You can now use the BY statement to create the FIRST. and LAST. variables so you know when the LAG and "LEAD" variables are valid.
Thank you mkeintz!
Based on your codes, I figured it out how to run five-day windows.
data Temp2; merge Temp1 Temp1 (firstobs=2 keep=t rename=(t=nxt_t)); run;
data Temp3(drop=nxt_t nxt2_t); merge Temp2 Temp2 (firstobs=2 keep=nxt_t rename=(nxt_t=nxt2_t));
if (nxt2_t=1 or nxt_t=1 or t=1 or lag(t)=1 or lag2(t)=1); if nxt2_t=1 then newvar+1; run;
Have a nice evening!
KS -,
data have ; input CUSIP $ DATE :mmddyy. RET t ; format date yymmdd10.; cards; 00030710 2/24/2017 -0.021 . 2017 2 1 00030710 2/27/2017 0.001 1 2017 2 1 00030710 2/28/2017 0.208 . 2017 2 1 00030710 5/2/2017 0.097 . 2017 5 2 00030710 5/3/2017 -0.055 1 2017 5 2 00030710 5/4/2017 0.077 . 2017 5 2 00030710 8/1/2017 0.032 . 2017 8 3 00030710 8/2/2017 -0.063 1 2017 8 3 00030710 8/3/2017 0.203 . 2017 8 3 00030710 10/31/2017 -0.034 . 2017 10 4 00030710 11/1/2017 -0.047 1 2017 11 4 00030710 11/2/2017 0.327 . 2017 11 4 00030710 2/20/2018 -0.023 . 2018 2 5 00030710 2/21/2018 0.031 1 2018 2 5 00030710 2/22/2018 0.072 . 2018 2 5 00077R10 9/22/1998 0.014 . 1998 9 1 00077R10 9/23/1998 0.108 1 1998 9 1 00077R10 9/24/1998 -0.308 . 1998 9 1 00077R10 12/2/1998 -0.019 . 1998 12 2 00077R10 12/3/1998 -0.125 1 1998 12 2 00077R10 12/4/1998 0.035 . 1998 12 2 00077R10 3/3/1999 0.010 . 1999 3 3 00077R10 3/4/1999 0.023 1 1999 3 3 00077R10 3/5/1999 0.088 . 1999 3 3 00077R10 4/29/1999 0.074 . 1999 4 3 00077R10 4/30/1999 0.018 1 1999 4 3 00077R10 5/3/1999 0.429 . 1999 5 3 ; data temp; set have; n+1; run; data index; set temp; if t=1; keep cusip n; run; data index2; set index; want+1; do n=n-1 to n+1; output; end; run; data want; if _n_=1 then do; if 0 then set index2; declare hash h(dataset:'index2',hashexp:20); h.definekey('cusip','n'); h.definedata('want'); h.definedone(); end; set temp; h.find(); run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.