BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
rkdubey84
Obsidian | Level 7

I have intraday data for a given stock spanning over 3 months sorted by date and time.

 

My Data Looks Like:  

 

DTM1                               Tradeprice
03JUN2013:09:15:00       164.6
03JUN2013:09:15:00       167
03JUN2013:09:15:00       166
03JUN2013:09:15:01       164.15
03JUN2013:09:15:01       176.1
03JUN2013:09:15:01       169.3
03JUN2013:09:15:01       165

................................................
03JUN2013:15:29:00       180.7
03JUN2013:15:29:00       170

................................................
03JUN2013:15:29:59       180.3
04JUN2013:09:15:00       176
04JUN2013:09:15:00       175.5

................................................
04JUN2013:09:27:00       165.1

...............................................
04JUN2013:09:28:00       164.3

..............................................
05JUN2013:09:29:00       170

...............................................

 

Please Note: 

1. For every Second there are multiple Tradeprices

2. Data Moves Second by Second to next minute

3. Data Changes next Date as reprented by Variable DTM1

 

I want to Compute 1. Price Volatility (Standard Deviation) for Every Minute

                              2. Price Volatility for Every 5 Minutes

                              3. Return [(Last Price - First Price)/First Price *100 ]Volatility for Every Minute

                              2. Return Volatility for Every 5 Minutes

                              3. HI (Max Price / Return) & Low (Min Price / Return) for Every Minute and 5 minute  and Day interval.

 

 Attaching the Dataset for your reference.

 

Please Help. 

Thanks in Advance

Ritesh

 

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

For the 5-minute return volatility you want the std of 300 1-second returns, where you define 1-second return as

       [(last price within the second) - (first price within the second)] / (first price within the second).

 

Two points:

  1. Please realize that compounding those 300 1-second returns would NOT reproduce the 5 minute return.  And that's not because of computational precision issues.  It's because your definition would have neglected price movement between the last trade of second T-1 to the first trade of second T.   Maybe you should consider return for time T as
            [(closing price of second T)-(closing price of second T-1)]/(closing price of second T-1)
    After all that's how CRSP (center for research in stock prices) defines daily returns - it compares closing prices on successive days.
  2. That brings us back to how to treat holes in your data.  Do you really want to treat intervals in which there was no trading as missing, or should you treat them as intervals in which there was no price change - i.e. as intervals in which return is zero?   Now for highly liquid stocks (GOOG, MSFT, etc.) it doesn't matter because there will be very few holes.

    But for illiquid stocks, treating those seconds as having zero returns (i.e. duplicate prices) will usually lower std of price for stocks that have mixes of positive and negative returns (i.e. most stocks on most days), while treating them as missing will increase std.  Maybe you should exclude stocks with too many holes.   Or maybe, for those stocks, you should be using 5-second returns.  How did the research article you wish to replicate treat stocks with many intervals containing no trades?

    Of course if you measure returns from closing price to closing price, then when there is a hole, you also have a decision to make about calculating returns.  Let's say you have price for 13:01:01 following by a price for 13:01:04, a 3-second interval between closing prices.  Then you could either
    1. take the cube root of the 3-second return
             average_per_second_return  =    {1+ [close_price(13:01:04)-close_price(13:01:01)]/close_price(13:01:01)}** (1/3) - 1
      and assign it to each of the missing seconds,  or
    2. just assign the complete return to the 13:01:04 time stamp.

I would take option 2, which is equivalent to dropping all 1-second intervals in which there is no trading, as per the program below.

 

OK, back to the operational problem.

 

Price volatility for 1-minute intervals (up to 60 prices) and 5-minute intervals (up to 300 prices).  If it's a liquid stock, using average price per second, vs closing price, is just not going to have much impact on your price volatilty measure.  Unless you intend to start studying high frequency trading as it exists today, which makes whole-second time stamps archaic (trades and quotes are now time-stamped to the nanosecond).  So let's simplify your operational needs and use closing price for price volatility measures, which is also consistent with any likely definition of returns. 

 

Return volatility.  I'm going to use the closing-price to closing-price measure for return.

 

data need / view=need;
  set have end=end_of_have1;
  by stock dtm1;
  date=datepart(dtm1);  format date yymmddn8.;
  time=timepart(dtm1);  format time time8.;

  if last.dtm1;

  single_minute= time-mod(time,60); 
  five_minute=   time-mod(time,300);
  format single_minute five_minute time8. ;

  holding_period=dif(dtm1);   /* holding period in secs, almost always=1*/
  return=  dif(price)/lag(price);

  if dif(date)^=0 then do;
    holding_period=.;
    return=.;
  end;
run;

proc summary data=need;
  by stock date five_minute;
  class single_minute;
  output out=want   std(price)=price_vol   std(return)=return_vol;
run;

 Notes:

  1. I assume your data is sorted by stock dtm1.
  2. The data need /view=need creates identifiers for 1-minute intervals and 5-minute intervals.
  3. The "if last.dtm1" is a subsetting if which means only records with are the last for a given time stamp pass to the rest of the data step - i.e. we only keep closing prices.
  4. The "dif(x)" function is equivalent to "x-lag(x)".
  5. The "if dif(date) .." do group tests for a change in date.  If there is a change then this is the first record for a new data, and its return and holding_period should be set to missing.
  6. The PROC SUMMARY will make a data set with these variables.
    1. stock date and five_minute      ... the BY vars
    2. single_minute                          ... the class var
      1. Sometime the class var single_minute will be missing, which means that observation represents all the one-minute intervals for a given five-minute interval - i.e. summary data for five_minute
    3. price_vol and return_vol         ... std of price and return for combinations of the by vars and class var
    4. _TYPE_                       classification type.
      1. _TYPE_=0 means the class var (single_minute) is ignored, and that record is for the five-minute grouping.   single_minute will be set to missing for this record.
      2. _TYPE_=1 means the class var is applied, and the3 record provide statistics where unit of aggregation is one minute.
    5. _FREQ_              Number of observations used for the combination. 

You can ask proc summary for other stats as well.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

11 REPLIES 11
mkeintz
PROC Star

"Price volatility for every minute" has a clear meaning, i.e. the std dev of all prices within the minute (or within 5 minutes).

 

But what do you mean by "return volatility for every minute"?.  According to your definition, you calculate only one return per minute (based on last price - first price).  There is no return volatility.   Or do you mean you want

  1. The daily volatility of 1-minute (and 5-minute) returns, or
  2. Within minute (and within 5 minute) volatility of trade-by-trade returns

please clarify

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
rkdubey84
Obsidian | Level 7

HI @mkeintz, Good Afternoon!!

 

The authors of the research article that I am following have used the following equation for computing return volatility:

RVOL5min = SQRT[Sigma i = 1 to 300 (Ret i - Avg(Ret))^2/N-1]       {300 denotes Seconds for 5 minutes}

 

But I believe what you suggested sounds more apt to me i.e.; 1. Within minute (and within 5 minute) volatility of trade-by-trade returns.

 

Thanks

Ritesh

 

Reeza
Super User

How do you want to handle the multiple prices per second. 

rkdubey84
Obsidian | Level 7

HI @Reeza,

Either the first Price of that second or the average price of that second can be taken.

 

Thanks

Ritesh

mkeintz
PROC Star

For the 5-minute return volatility you want the std of 300 1-second returns, where you define 1-second return as

       [(last price within the second) - (first price within the second)] / (first price within the second).

 

Two points:

  1. Please realize that compounding those 300 1-second returns would NOT reproduce the 5 minute return.  And that's not because of computational precision issues.  It's because your definition would have neglected price movement between the last trade of second T-1 to the first trade of second T.   Maybe you should consider return for time T as
            [(closing price of second T)-(closing price of second T-1)]/(closing price of second T-1)
    After all that's how CRSP (center for research in stock prices) defines daily returns - it compares closing prices on successive days.
  2. That brings us back to how to treat holes in your data.  Do you really want to treat intervals in which there was no trading as missing, or should you treat them as intervals in which there was no price change - i.e. as intervals in which return is zero?   Now for highly liquid stocks (GOOG, MSFT, etc.) it doesn't matter because there will be very few holes.

    But for illiquid stocks, treating those seconds as having zero returns (i.e. duplicate prices) will usually lower std of price for stocks that have mixes of positive and negative returns (i.e. most stocks on most days), while treating them as missing will increase std.  Maybe you should exclude stocks with too many holes.   Or maybe, for those stocks, you should be using 5-second returns.  How did the research article you wish to replicate treat stocks with many intervals containing no trades?

    Of course if you measure returns from closing price to closing price, then when there is a hole, you also have a decision to make about calculating returns.  Let's say you have price for 13:01:01 following by a price for 13:01:04, a 3-second interval between closing prices.  Then you could either
    1. take the cube root of the 3-second return
             average_per_second_return  =    {1+ [close_price(13:01:04)-close_price(13:01:01)]/close_price(13:01:01)}** (1/3) - 1
      and assign it to each of the missing seconds,  or
    2. just assign the complete return to the 13:01:04 time stamp.

I would take option 2, which is equivalent to dropping all 1-second intervals in which there is no trading, as per the program below.

 

OK, back to the operational problem.

 

Price volatility for 1-minute intervals (up to 60 prices) and 5-minute intervals (up to 300 prices).  If it's a liquid stock, using average price per second, vs closing price, is just not going to have much impact on your price volatilty measure.  Unless you intend to start studying high frequency trading as it exists today, which makes whole-second time stamps archaic (trades and quotes are now time-stamped to the nanosecond).  So let's simplify your operational needs and use closing price for price volatility measures, which is also consistent with any likely definition of returns. 

 

Return volatility.  I'm going to use the closing-price to closing-price measure for return.

 

data need / view=need;
  set have end=end_of_have1;
  by stock dtm1;
  date=datepart(dtm1);  format date yymmddn8.;
  time=timepart(dtm1);  format time time8.;

  if last.dtm1;

  single_minute= time-mod(time,60); 
  five_minute=   time-mod(time,300);
  format single_minute five_minute time8. ;

  holding_period=dif(dtm1);   /* holding period in secs, almost always=1*/
  return=  dif(price)/lag(price);

  if dif(date)^=0 then do;
    holding_period=.;
    return=.;
  end;
run;

proc summary data=need;
  by stock date five_minute;
  class single_minute;
  output out=want   std(price)=price_vol   std(return)=return_vol;
run;

 Notes:

  1. I assume your data is sorted by stock dtm1.
  2. The data need /view=need creates identifiers for 1-minute intervals and 5-minute intervals.
  3. The "if last.dtm1" is a subsetting if which means only records with are the last for a given time stamp pass to the rest of the data step - i.e. we only keep closing prices.
  4. The "dif(x)" function is equivalent to "x-lag(x)".
  5. The "if dif(date) .." do group tests for a change in date.  If there is a change then this is the first record for a new data, and its return and holding_period should be set to missing.
  6. The PROC SUMMARY will make a data set with these variables.
    1. stock date and five_minute      ... the BY vars
    2. single_minute                          ... the class var
      1. Sometime the class var single_minute will be missing, which means that observation represents all the one-minute intervals for a given five-minute interval - i.e. summary data for five_minute
    3. price_vol and return_vol         ... std of price and return for combinations of the by vars and class var
    4. _TYPE_                       classification type.
      1. _TYPE_=0 means the class var (single_minute) is ignored, and that record is for the five-minute grouping.   single_minute will be set to missing for this record.
      2. _TYPE_=1 means the class var is applied, and the3 record provide statistics where unit of aggregation is one minute.
    5. _FREQ_              Number of observations used for the combination. 

You can ask proc summary for other stats as well.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Reeza
Super User

Did the solutions from your previous questions not work? I think we've seen several variations of the same question here, hopefully you've learned the methods and can expand on them. 

rkdubey84
Obsidian | Level 7
HI @Reeza,

Yes I am using my last solutions as well for other cases.

In the last solution, we had a single observation for each time periods. This time we have multiple ones. And also we need to transform the price variables into return variables.

Here I am stuck also because I am novice at SQL though, I can edit the basic codes as per requirement.

For Example: There is another measure of Volatility Parkinson's 1980 (Hi-Lo) Measure. And I thought of using the same if this doesn't work out. I have expanded the earlier code as below:

proc sql;
create table ACC as
select *,
(select Max(TradePrice) from ACC_ID where dtm1=a.dtm1+'00:01:00't ) as HI1,
(select Max(TradePrice) from ACC_ID where dtm1=a.dtm1+'00:05:00't ) as HI5,
(select Min(TradePrice) from ACC_ID where dtm1=a.dtm1+'00:01:00't ) as LO1,
(select Min(TradePrice) from ACC_ID where dtm1=a.dtm1+'00:05:00't ) as LO5,
(select Count(TradePrice)from ACC_ID where dtm1=a.dtm1+'00:01:00't ) as Count1,
(select Count(TradePrice)from ACC_ID where dtm1=a.dtm1+'00:05:00't ) as Count5
from ACC_ID as a;
quit;

Data ACC; Set ACC;
PVol_1min = SQRT(1/_Freq_*(log(Hi1/Lo1)**2));
PVol_5min = SQRT(1/_Freq_*(log(Hi5/Lo5)**2));
run;
rkdubey84
Obsidian | Level 7
I am running the above code for a data-set containing 708045 Observations for 2 variables. And Its Taking approx 41 minutes to run. Is it reasonable? Am I doing something wrong?
Reeza
Super User

No and no. Proc expand is more efficient. Wasn't one of the solutions previously using PROC Expand?

rkdubey84
Obsidian | Level 7
HI Reeza,


Proc Expand didn't work for me, I do have ETS Installed. It would be great if you could help / guide me with Proc Expand Code. Really appreciate your help.


Thanks

Ritesh
Ksharp
Super User

If you have a big table try Hash Table, not SQL .

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 11 replies
  • 4868 views
  • 2 likes
  • 4 in conversation