BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
KS99
Obsidian | Level 7

Hi, greetings, 

First of all, I sincerely thank all of you for saving my life many times (I wish I could pay you for help). 

 

The current difficulty I am going through is the following: 

 

I am dealing with a large dataset of daily stock returns. 

When a company incurs an event on a certain day, I am going to extract three stock returns on that day, the previous day, and the next day. So, if we designate the event day as zero, the event window can be formalized as window (-1, +1): further, I am going to calculate the mean of the three returns. 

 

Based on the large daily data, I identified such event days from numerous companies by creating a dummy t (=1, missing otherwise). Then, using the following codes, 

 

data window; set daily_stock; by cusip; if t>. then do p=max(_n_-1,1,p+1) to min(_n_+1,nobs);
set daily_stock point=p nobs=nobs; output; end; run; 

 

I obtained the results below. 

 

CUSIPDATERETtYearMonthnewvar
000307102/24/2017-0.021 201721
000307102/27/20170.0011201721
000307102/28/20170.208 201721
000307105/2/20170.097 201752
000307105/3/2017-0.0551201752
000307105/4/20170.077 201752
000307108/1/20170.032 201783
000307108/2/2017-0.0631201783
000307108/3/20170.203 201783
0003071010/31/2017-0.034 2017104
0003071011/1/2017-0.04712017114
0003071011/2/20170.327 2017114
000307102/20/2018-0.023 201825
000307102/21/20180.0311201825
000307102/22/20180.072 201825
00077R109/22/19980.014 199891
00077R109/23/19980.1081199891
00077R109/24/1998-0.308 199891
00077R1012/2/1998-0.019 1998122
00077R1012/3/1998-0.12511998122
00077R1012/4/19980.035 1998122
00077R103/3/19990.010 199933
00077R103/4/19990.0231199933
00077R103/5/19990.088 199933
00077R104/29/19990.074 199943
00077R104/30/19990.0181199943
00077R105/3/19990.429 199953

 

From here, I need to create the group of three consecutive observations with "t" in the middle. In other words, I need to identify "t", a row above "t", and a row below "t" in a group. 

 

At first, this task looked very simple. So, I created year and month variables, and I thought grouping by year-months will accomplish it. However, there are cases in which a three-day window covers more than one month. So, I created the "newvar" that increases by 1 only when month jumps by 2 or more. But then again, there are three-day windows lying side by side in consecutive months (See cusip 00077R10 in 1999 over months 3, 4, and 5). 

 

So, please give me a hand to solve this. I want to be very rigorous and precise, since I have to submit the data and codes together with my manuscript, to a journal. 

 

Many thanks in advance!! 

 

Sincerely, 

KS -, 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

Your code works but is likely slow, since it depends on direct access (instead of sequential) to support the SET with POINT=.

 

Assuming:

  1. Your data are sorted by CUSIP/DATE
  2. None of the t=1 returns happens in the very first or very last observation ofa given cusip
  3. None of the t=1 returns are separated by fewer than two observations.

This means you will always have complete triplets available.   And you can use the code below to both filter through the triplets, and to assign a newvar which (I presume) is a sequential event id called NEWVAR.  NEWVAR increases with each new triplet:

data want (drop=nxt_t);
  merge have 
        have (firstobs=2 keep=t rename=(t=nxt_t));
  if (nxt_t=1  or t=1 or lag(t)=1 ) ;
  if nxt_t=1 then newvar+1;
run;

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

13 REPLIES 13
Tom
Super User Tom
Super User

What do you actually mean by a "three day window"?

Do you just the three observations in a row?  Even if they they are many days apart?

Do you mean three calendar dates?  But most stock markets are closed on week-ends and holidays.  Do you know what dates to exclude ?

 

If you just want to use the actual observations try a look back - look ahead technique.  Look back is easy using LAG() function.  Look ahead takes a little finagling.   Here is a method using an extra SET statement.

data have ;
  input CUSIP $ DATE :mmddyy. RET t ;
  format date yymmdd10.;
cards;
00030710 2/24/2017 -0.021 . 2017 2 1
00030710 2/27/2017 0.001 1 2017 2 1
00030710 2/28/2017 0.208 . 2017 2 1
00030710 5/2/2017 0.097 . 2017 5 2
00030710 5/3/2017 -0.055 1 2017 5 2
00030710 5/4/2017 0.077 . 2017 5 2
00030710 8/1/2017 0.032 . 2017 8 3
00030710 8/2/2017 -0.063 1 2017 8 3
00030710 8/3/2017 0.203 . 2017 8 3
00030710 10/31/2017 -0.034 . 2017 10 4
00030710 11/1/2017 -0.047 1 2017 11 4
00030710 11/2/2017 0.327 . 2017 11 4
00030710 2/20/2018 -0.023 . 2018 2 5
00030710 2/21/2018 0.031 1 2018 2 5
00030710 2/22/2018 0.072 . 2018 2 5
00077R10 9/22/1998 0.014 . 1998 9 1
00077R10 9/23/1998 0.108 1 1998 9 1
00077R10 9/24/1998 -0.308 . 1998 9 1
00077R10 12/2/1998 -0.019 . 1998 12 2
00077R10 12/3/1998 -0.125 1 1998 12 2
00077R10 12/4/1998 0.035 . 1998 12 2
00077R10 3/3/1999 0.010 . 1999 3 3
00077R10 3/4/1999 0.023 1 1999 3 3
00077R10 3/5/1999 0.088 . 1999 3 3
00077R10 4/29/1999 0.074 . 1999 4 3
00077R10 4/30/1999 0.018 1 1999 4 3
00077R10 5/3/1999 0.429 . 1999 5 3
;

data want;
  set have ;
  by cusip date;
  prev_ret=lag(ret);
  if first.cusip then prev_ret=.;
  set have(firstobs=2 keep=RET rename=(ret=next_ret)) have(obs=1 drop=_all_);
  if last.cusip then next_ret=.;
  n_ret=n(prev_ret,ret,next_ret);
  mean_ret = mean(prev_ret,ret,next_ret);
run;

If you did want to use actual dates then you could just expand your existing data to fill in the gaps in days by carrying forward the current value into the missing dates.  Then apply the same look ahead method to the resulting datasets.

KS99
Obsidian | Level 7

I have already extracted the closest two days surrounding the event day. The event day is 1 on variable t, and the previous and next days are the missing obs. one row above and below the t=1. So, t=1 is important. 

 

Tom
Super User Tom
Super User

The code I posted calculate the mean(RET) of (up to) three observations .

If did nothing to the T variable so you can use the T variable for the next steps in your analysis.

For example you might delete the observations where T is missing.

 

mkeintz
PROC Star

Your code works but is likely slow, since it depends on direct access (instead of sequential) to support the SET with POINT=.

 

Assuming:

  1. Your data are sorted by CUSIP/DATE
  2. None of the t=1 returns happens in the very first or very last observation ofa given cusip
  3. None of the t=1 returns are separated by fewer than two observations.

This means you will always have complete triplets available.   And you can use the code below to both filter through the triplets, and to assign a newvar which (I presume) is a sequential event id called NEWVAR.  NEWVAR increases with each new triplet:

data want (drop=nxt_t);
  merge have 
        have (firstobs=2 keep=t rename=(t=nxt_t));
  if (nxt_t=1  or t=1 or lag(t)=1 ) ;
  if nxt_t=1 then newvar+1;
run;

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
KS99
Obsidian | Level 7

Thank you so much, mkeintz! 

Can you explain the line here specified? 

have (firstobs=2 keep=t rename=(t=nxt_t));

I'd like to understand and learn it. 

 

 

KS99
Obsidian | Level 7

Supposing I am going to do the same thing, this time, for the window (-2,+2) surrounding t=1, should I simply replace the numbers like this? 

data want (drop=nxt_t);
  merge have 
        have (firstobs=4 keep=t rename=(t=nxt_t));
  if (nxt_t=1  or t=1 or lag2(t)=1 ) ;
  if nxt_t=1 then newvar+1;
run;

firstobs=4 instead of firstobs=2, and lag2(t)=1 instead of lag(t)=1, like I have done above? 

Tom
Super User Tom
Super User

What is your goal?

Say you had 10 observations.  Here is how those observations are going to contribute.

MAIN  LAG2()  FIRSTOBS=4
 1    .    4
 2    .    5
 3    1    6
 4    2    7
 5    3    8
 6    4    9
 7    5   10
 8    6   10
 9    7   10
10    8   10

If you actually want to compare 5 observations on "days" -2,-1,0,+1,+2 then you need something different.

Can you figure that out?

 

KS99
Obsidian | Level 7

Thank you Tom , 

 

Based on mkeintz's codes, I figured it out. 

 

data Temp2; merge Temp1 Temp1 (firstobs=2 keep=t rename=(t=nxt_t)); run;
data Temp3(drop=nxt_t nxt2_t); merge Temp2 Temp2 (firstobs=2 keep=nxt_t rename=(nxt_t=nxt2_t));
if (nxt2_t=1 or nxt_t=1 or t=1 or lag(t)=1 or lag2(t)=1); if nxt2_t=1 then newvar+1; run; 

 

Although it is a primitive extension of his codes, it works well. 

Don't worry Tom. 

 

Tom
Super User Tom
Super User

POINT=?   You must be thinking of some other code.

Note that many shops will have options set that will trigger an error if you try to use MERGE without a BY statement.

 

I am not sure how that finds the mean of anything.  I guess the idea is to then run the results through PROC MEANS or some other step to do the actual analysis?

mkeintz
PROC Star

@Tom 

POINT=?   You must be thinking of some other code.

I was thinking of the OP's code:

 

data window; set daily_stock; by cusip; if t>. then do p=max(_n_-1,1,p+1) to min(_n_+1,nobs);
set daily_stock point=p nobs=nobs; output; end; run; 

 

which created a data set that (I presume) the OP figured would need a second data step.to generate event id's.  A good opportunity to speed up the original code by doing both tasks in a single step without the point=

 

Note that many shops will have options set that will trigger an error if you try to use MERGE without a BY statement.

I am aware that SAS can be told to WARN users of a MERGE without BY.  Didn't know it could be upgraded to ERROR.  Regardless, those shops are taking away a robust technique for doing lookaheads.

 

This is especially true in the presence of BY groups.  If you actually have BY groups, then a merge statement accompanied by BY would fail to preserve the lookahead enabled by firstobs=2.  Once the second by group starts, the lookahead disappears.  So my technique for lookahead in the presence of by-groups is to tie the BY statement to a SET and use MERGE without BY.

 

data want;
  set have (keep=id);
  by id;
  merge have
        have (firstobs=2 keep=a b c rename=(a=nxta b=nxtb c=nxtc));
  /* other code here that makes use of nxta, nxtb, nxtc, first.id and last.id*/
run;

 

 

 

I am not sure how that finds the mean of anything.  I guess the idea is to then run the results through PROC MEANS or some other step to do the actual analysis?


Yup.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Tom
Super User Tom
Super User

So you are back to the method I posted. 

You can now use the BY statement to create the FIRST. and LAST. variables so you know when the LAG and "LEAD" variables are valid. 

KS99
Obsidian | Level 7

Thank you mkeintz! 

 

Based on your codes, I figured it out how to run five-day windows. 

 

data Temp2; merge Temp1 Temp1 (firstobs=2 keep=t rename=(t=nxt_t)); run;
data Temp3(drop=nxt_t nxt2_t); merge Temp2 Temp2 (firstobs=2 keep=nxt_t rename=(nxt_t=nxt2_t));
if (nxt2_t=1 or nxt_t=1 or t=1 or lag(t)=1 or lag2(t)=1); if nxt2_t=1 then newvar+1; run;

 

Have a nice evening! 

KS -, 

 

Ksharp
Super User
data have ;
  input CUSIP $ DATE :mmddyy. RET t ;
  format date yymmdd10.;
cards;
00030710 2/24/2017 -0.021 . 2017 2 1
00030710 2/27/2017 0.001 1 2017 2 1
00030710 2/28/2017 0.208 . 2017 2 1
00030710 5/2/2017 0.097 . 2017 5 2
00030710 5/3/2017 -0.055 1 2017 5 2
00030710 5/4/2017 0.077 . 2017 5 2
00030710 8/1/2017 0.032 . 2017 8 3
00030710 8/2/2017 -0.063 1 2017 8 3
00030710 8/3/2017 0.203 . 2017 8 3
00030710 10/31/2017 -0.034 . 2017 10 4
00030710 11/1/2017 -0.047 1 2017 11 4
00030710 11/2/2017 0.327 . 2017 11 4
00030710 2/20/2018 -0.023 . 2018 2 5
00030710 2/21/2018 0.031 1 2018 2 5
00030710 2/22/2018 0.072 . 2018 2 5
00077R10 9/22/1998 0.014 . 1998 9 1
00077R10 9/23/1998 0.108 1 1998 9 1
00077R10 9/24/1998 -0.308 . 1998 9 1
00077R10 12/2/1998 -0.019 . 1998 12 2
00077R10 12/3/1998 -0.125 1 1998 12 2
00077R10 12/4/1998 0.035 . 1998 12 2
00077R10 3/3/1999 0.010 . 1999 3 3
00077R10 3/4/1999 0.023 1 1999 3 3
00077R10 3/5/1999 0.088 . 1999 3 3
00077R10 4/29/1999 0.074 . 1999 4 3
00077R10 4/30/1999 0.018 1 1999 4 3
00077R10 5/3/1999 0.429 . 1999 5 3
;

data temp;
 set have;
 n+1;
run;
data index;
 set temp;
 if t=1;
 keep cusip n;
run;
data index2;
 set index;
 want+1;
 do n=n-1 to n+1;
  output;
 end;
run;
data want;
 if _n_=1 then do;
  if 0 then set index2;
  declare hash h(dataset:'index2',hashexp:20);
  h.definekey('cusip','n');
  h.definedata('want');
  h.definedone();
 end;
set temp;
h.find();
run;

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 995 views
  • 1 like
  • 4 in conversation