Solved: How to identify three-day (more more) windows around an event (and cal...

KS99 · Posted 03-05-2022 07:35 PM

Hi, greetings,

First of all, I sincerely thank all of you for saving my life many times (I wish I could pay you for help).

The current difficulty I am going through is the following:

I am dealing with a large dataset of daily stock returns.

When a company incurs an event on a certain day, I am going to extract three stock returns on that day, the previous day, and the next day. So, if we designate the event day as zero, the event window can be formalized as window (-1, +1): further, I am going to calculate the mean of the three returns.

Based on the large daily data, I identified such event days from numerous companies by creating a dummy t (=1, missing otherwise). Then, using the following codes,

data window; set daily_stock; by cusip; if t>. then do p=max(_n_-1,1,p+1) to min(_n_+1,nobs);
set daily_stock point=p nobs=nobs; output; end; run;

I obtained the results below.

CUSIP	DATE	RET	t	Year	Month	newvar
00030710	2/24/2017	-0.021		2017	2	1
00030710	2/27/2017	0.001	1	2017	2	1
00030710	2/28/2017	0.208		2017	2	1
00030710	5/2/2017	0.097		2017	5	2
00030710	5/3/2017	-0.055	1	2017	5	2
00030710	5/4/2017	0.077		2017	5	2
00030710	8/1/2017	0.032		2017	8	3
00030710	8/2/2017	-0.063	1	2017	8	3
00030710	8/3/2017	0.203		2017	8	3
00030710	10/31/2017	-0.034		2017	10	4
00030710	11/1/2017	-0.047	1	2017	11	4
00030710	11/2/2017	0.327		2017	11	4
00030710	2/20/2018	-0.023		2018	2	5
00030710	2/21/2018	0.031	1	2018	2	5
00030710	2/22/2018	0.072		2018	2	5
00077R10	9/22/1998	0.014		1998	9	1
00077R10	9/23/1998	0.108	1	1998	9	1
00077R10	9/24/1998	-0.308		1998	9	1
00077R10	12/2/1998	-0.019		1998	12	2
00077R10	12/3/1998	-0.125	1	1998	12	2
00077R10	12/4/1998	0.035		1998	12	2
00077R10	3/3/1999	0.010		1999	3	3
00077R10	3/4/1999	0.023	1	1999	3	3
00077R10	3/5/1999	0.088		1999	3	3
00077R10	4/29/1999	0.074		1999	4	3
00077R10	4/30/1999	0.018	1	1999	4	3
00077R10	5/3/1999	0.429		1999	5	3

From here, I need to create the group of three consecutive observations with "t" in the middle. In other words, I need to identify "t", a row above "t", and a row below "t" in a group.

At first, this task looked very simple. So, I created year and month variables, and I thought grouping by year-months will accomplish it. However, there are cases in which a three-day window covers more than one month. So, I created the "newvar" that increases by 1 only when month jumps by 2 or more. But then again, there are three-day windows lying side by side in consecutive months (See cusip 00077R10 in 1999 over months 3, 4, and 5).

So, please give me a hand to solve this. I want to be very rigorous and precise, since I have to submit the data and codes together with my manuscript, to a journal.

Many thanks in advance!!

Sincerely,

KS -,

mkeintz · Posted 03-05-2022 09:12 PM

Your code works but is likely slow, since it depends on direct access (instead of sequential) to support the SET with POINT=.

Assuming:

Your data are sorted by CUSIP/DATE
None of the t=1 returns happens in the very first or very last observation ofa given cusip
None of the t=1 returns are separated by fewer than two observations.

This means you will always have complete triplets available. And you can use the code below to both filter through the triplets, and to assign a newvar which (I presume) is a sequential event id called NEWVAR. NEWVAR increases with each new triplet:

data want (drop=nxt_t);
  merge have 
        have (firstobs=2 keep=t rename=(t=nxt_t));
  if (nxt_t=1  or t=1 or lag(t)=1 ) ;
  if nxt_t=1 then newvar+1;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

Tom · Posted 03-05-2022 09:03 PM

What do you actually mean by a "three day window"?

Do you just the three observations in a row? Even if they they are many days apart?

Do you mean three calendar dates? But most stock markets are closed on week-ends and holidays. Do you know what dates to exclude ?

If you just want to use the actual observations try a look back - look ahead technique. Look back is easy using LAG() function. Look ahead takes a little finagling. Here is a method using an extra SET statement.

data have ;
  input CUSIP $ DATE :mmddyy. RET t ;
  format date yymmdd10.;
cards;
00030710 2/24/2017 -0.021 . 2017 2 1
00030710 2/27/2017 0.001 1 2017 2 1
00030710 2/28/2017 0.208 . 2017 2 1
00030710 5/2/2017 0.097 . 2017 5 2
00030710 5/3/2017 -0.055 1 2017 5 2
00030710 5/4/2017 0.077 . 2017 5 2
00030710 8/1/2017 0.032 . 2017 8 3
00030710 8/2/2017 -0.063 1 2017 8 3
00030710 8/3/2017 0.203 . 2017 8 3
00030710 10/31/2017 -0.034 . 2017 10 4
00030710 11/1/2017 -0.047 1 2017 11 4
00030710 11/2/2017 0.327 . 2017 11 4
00030710 2/20/2018 -0.023 . 2018 2 5
00030710 2/21/2018 0.031 1 2018 2 5
00030710 2/22/2018 0.072 . 2018 2 5
00077R10 9/22/1998 0.014 . 1998 9 1
00077R10 9/23/1998 0.108 1 1998 9 1
00077R10 9/24/1998 -0.308 . 1998 9 1
00077R10 12/2/1998 -0.019 . 1998 12 2
00077R10 12/3/1998 -0.125 1 1998 12 2
00077R10 12/4/1998 0.035 . 1998 12 2
00077R10 3/3/1999 0.010 . 1999 3 3
00077R10 3/4/1999 0.023 1 1999 3 3
00077R10 3/5/1999 0.088 . 1999 3 3
00077R10 4/29/1999 0.074 . 1999 4 3
00077R10 4/30/1999 0.018 1 1999 4 3
00077R10 5/3/1999 0.429 . 1999 5 3
;

data want;
  set have ;
  by cusip date;
  prev_ret=lag(ret);
  if first.cusip then prev_ret=.;
  set have(firstobs=2 keep=RET rename=(ret=next_ret)) have(obs=1 drop=_all_);
  if last.cusip then next_ret=.;
  n_ret=n(prev_ret,ret,next_ret);
  mean_ret = mean(prev_ret,ret,next_ret);
run;

If you did want to use actual dates then you could just expand your existing data to fill in the gaps in days by carrying forward the current value into the missing dates. Then apply the same look ahead method to the resulting datasets.

KS99 · Posted 03-05-2022 09:14 PM

I have already extracted the closest two days surrounding the event day. The event day is 1 on variable t, and the previous and next days are the missing obs. one row above and below the t=1. So, t=1 is important.

Tom · Posted 03-05-2022 09:37 PM

The code I posted calculate the mean(RET) of (up to) three observations .

If did nothing to the T variable so you can use the T variable for the next steps in your analysis.

For example you might delete the observations where T is missing.

mkeintz · Posted 03-05-2022 09:12 PM

Your code works but is likely slow, since it depends on direct access (instead of sequential) to support the SET with POINT=.

Assuming:

Your data are sorted by CUSIP/DATE
None of the t=1 returns happens in the very first or very last observation ofa given cusip
None of the t=1 returns are separated by fewer than two observations.

This means you will always have complete triplets available. And you can use the code below to both filter through the triplets, and to assign a newvar which (I presume) is a sequential event id called NEWVAR. NEWVAR increases with each new triplet:

data want (drop=nxt_t);
  merge have 
        have (firstobs=2 keep=t rename=(t=nxt_t));
  if (nxt_t=1  or t=1 or lag(t)=1 ) ;
  if nxt_t=1 then newvar+1;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

KS99 · Posted 03-05-2022 09:27 PM

Thank you so much, mkeintz!

Can you explain the line here specified?

have (firstobs=2 keep=t rename=(t=nxt_t));

I'd like to understand and learn it.

KS99 · Posted 03-05-2022 09:34 PM

Supposing I am going to do the same thing, this time, for the window (-2,+2) surrounding t=1, should I simply replace the numbers like this?

data want (drop=nxt_t);
  merge have 
        have (firstobs=4 keep=t rename=(t=nxt_t));
  if (nxt_t=1  or t=1 or lag2(t)=1 ) ;
  if nxt_t=1 then newvar+1;
run;

firstobs=4 instead of firstobs=2, and lag2(t)=1 instead of lag(t)=1, like I have done above?

Tom · Posted 03-05-2022 10:07 PM

What is your goal?

Say you had 10 observations. Here is how those observations are going to contribute.

MAIN  LAG2()  FIRSTOBS=4
 1    .    4
 2    .    5
 3    1    6
 4    2    7
 5    3    8
 6    4    9
 7    5   10
 8    6   10
 9    7   10
10    8   10

If you actually want to compare 5 observations on "days" -2,-1,0,+1,+2 then you need something different.

Can you figure that out?

KS99 · Posted 03-05-2022 10:18 PM

Thank you Tom ,

Based on mkeintz's codes, I figured it out.

data Temp2; merge Temp1 Temp1 (firstobs=2 keep=t rename=(t=nxt_t)); run;
data Temp3(drop=nxt_t nxt2_t); merge Temp2 Temp2 (firstobs=2 keep=nxt_t rename=(nxt_t=nxt2_t));
if (nxt2_t=1 or nxt_t=1 or t=1 or lag(t)=1 or lag2(t)=1); if nxt2_t=1 then newvar+1; run;

Although it is a primitive extension of his codes, it works well.

Don't worry Tom.

Tom · Posted 03-05-2022 09:38 PM

POINT=? You must be thinking of some other code.

Note that many shops will have options set that will trigger an error if you try to use MERGE without a BY statement.

I am not sure how that finds the mean of anything. I guess the idea is to then run the results through PROC MEANS or some other step to do the actual analysis?

mkeintz · Posted 03-06-2022 11:25 AM

@Tom

POINT=? You must be thinking of some other code.

I was thinking of the OP's code:

data window; set daily_stock; by cusip; if t>. then do p=max(_n_-1,1,p+1) to min(_n_+1,nobs);
set daily_stock point=p nobs=nobs; output; end; run;

which created a data set that (I presume) the OP figured would need a second data step.to generate event id's. A good opportunity to speed up the original code by doing both tasks in a single step without the point=

Note that many shops will have options set that will trigger an error if you try to use MERGE without a BY statement.

I am aware that SAS can be told to WARN users of a MERGE without BY. Didn't know it could be upgraded to ERROR. Regardless, those shops are taking away a robust technique for doing lookaheads.

This is especially true in the presence of BY groups. If you actually have BY groups, then a merge statement accompanied by BY would fail to preserve the lookahead enabled by firstobs=2. Once the second by group starts, the lookahead disappears. So my technique for lookahead in the presence of by-groups is to tie the BY statement to a SET and use MERGE without BY.

data want;
  set have (keep=id);
  by id;
  merge have
        have (firstobs=2 keep=a b c rename=(a=nxta b=nxtb c=nxtc));
  /* other code here that makes use of nxta, nxtb, nxtc, first.id and last.id*/
run;

I am not sure how that finds the mean of anything. I guess the idea is to then run the results through PROC MEANS or some other step to do the actual analysis?

Yup.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Tom · Posted 03-06-2022 11:33 AM

So you are back to the method I posted.

You can now use the BY statement to create the FIRST. and LAST. variables so you know when the LAG and "LEAD" variables are valid.

KS99 · Posted 03-05-2022 10:20 PM

Thank you mkeintz!

Based on your codes, I figured it out how to run five-day windows.

data Temp2; merge Temp1 Temp1 (firstobs=2 keep=t rename=(t=nxt_t)); run;
data Temp3(drop=nxt_t nxt2_t); merge Temp2 Temp2 (firstobs=2 keep=nxt_t rename=(nxt_t=nxt2_t));
if (nxt2_t=1 or nxt_t=1 or t=1 or lag(t)=1 or lag2(t)=1); if nxt2_t=1 then newvar+1; run;

Have a nice evening!

KS -,

Ksharp · Posted 03-06-2022 05:37 AM

data have ;
  input CUSIP $ DATE :mmddyy. RET t ;
  format date yymmdd10.;
cards;
00030710 2/24/2017 -0.021 . 2017 2 1
00030710 2/27/2017 0.001 1 2017 2 1
00030710 2/28/2017 0.208 . 2017 2 1
00030710 5/2/2017 0.097 . 2017 5 2
00030710 5/3/2017 -0.055 1 2017 5 2
00030710 5/4/2017 0.077 . 2017 5 2
00030710 8/1/2017 0.032 . 2017 8 3
00030710 8/2/2017 -0.063 1 2017 8 3
00030710 8/3/2017 0.203 . 2017 8 3
00030710 10/31/2017 -0.034 . 2017 10 4
00030710 11/1/2017 -0.047 1 2017 11 4
00030710 11/2/2017 0.327 . 2017 11 4
00030710 2/20/2018 -0.023 . 2018 2 5
00030710 2/21/2018 0.031 1 2018 2 5
00030710 2/22/2018 0.072 . 2018 2 5
00077R10 9/22/1998 0.014 . 1998 9 1
00077R10 9/23/1998 0.108 1 1998 9 1
00077R10 9/24/1998 -0.308 . 1998 9 1
00077R10 12/2/1998 -0.019 . 1998 12 2
00077R10 12/3/1998 -0.125 1 1998 12 2
00077R10 12/4/1998 0.035 . 1998 12 2
00077R10 3/3/1999 0.010 . 1999 3 3
00077R10 3/4/1999 0.023 1 1999 3 3
00077R10 3/5/1999 0.088 . 1999 3 3
00077R10 4/29/1999 0.074 . 1999 4 3
00077R10 4/30/1999 0.018 1 1999 4 3
00077R10 5/3/1999 0.429 . 1999 5 3
;

data temp;
 set have;
 n+1;
run;
data index;
 set temp;
 if t=1;
 keep cusip n;
run;
data index2;
 set index;
 want+1;
 do n=n-1 to n+1;
  output;
 end;
run;
data want;
 if _n_=1 then do;
  if 0 then set index2;
  declare hash h(dataset:'index2',hashexp:20);
  h.definekey('cusip','n');
  h.definedata('want');
  h.definedone();
 end;
set temp;
h.find();
run;

How to identify three-day (more more) windows around an event (and calculate their means)?

Re: How to identify three-day (more more) windows around an event (and calculate their means)?

Re: How to identify three-day (more more) windows around an event (and calculate their means)?

Re: How to identify three-day (more more) windows around an event (and calculate their means)?

Re: How to identify three-day (more more) windows around an event (and calculate their means)?

Re: How to identify three-day (more more) windows around an event (and calculate their means)?

Re: How to identify three-day (more more) windows around an event (and calculate their means)?

Re: How to identify three-day (more more) windows around an event (and calculate their means)?

Re: How to identify three-day (more more) windows around an event (and calculate their means)?

Re: How to identify three-day (more more) windows around an event (and calculate their means)?

Re: How to identify three-day (more more) windows around an event (and calculate their means)?

Re: How to identify three-day (more more) windows around an event (and calculate their means)?

Re: How to identify three-day (more more) windows around an event (and calculate their means)?

Re: How to identify three-day (more more) windows around an event (and calculate their means)?

Re: How to identify three-day (more more) windows around an event (and calculate their means)?

Classroom Training Available!