Solved: Re: how to count time intervals with different group

LZHAO006 · Posted 12-21-2018 03:12 PM

Hi I have a stock quote data set with exchange information. I want to know how many minutes was a given exchange holding the quote for a given stock. Here's part of my raw data ( the column name is DATE Sym_root Time_M and EX)

Here's my current code

proc sort data= rawdata;
       by sym_root descending time_m;
run;

data test;
	set rawdata;
	by sym_root descending time_m;
		inforce = abs(dif(time_m));
			if first.sym_root
			then inforce = max(("16:00:00.000000000"t-time_m),0);
		run;

data test;
	set test;
		if EX = 'A' then A = inforce/60; else A =0;
		if EX = 'B' then B = inforce/60; else B =0;
		if EX = 'C' then C = inforce/60; else C =0;
		if EX = 'J' then J = inforce/60; else J =0;
		if EX = 'K' then K = inforce/60; else K =0;
		if EX = 'M' then M = inforce/60; else M =0;
		if EX = 'N' then N = inforce/60; else N =0;
		if EX = 'P' then P = inforce/60; else P =0;
		if EX = 'Q' then Q = inforce/60; else Q =0;
		if EX = 'T' then T = inforce/60; else T =0;
		if EX = 'V' then V = inforce/60; else V =0;
		if EX = 'X' then X = inforce/60; else X =0;
		if EX = 'Y' then Y = inforce/60; else Y =0;
		if EX = 'Z' then Z = inforce/60; else Z =0;
run;

proc report data=test nowd out= timeshare (drop=_BREAK_);
	column date sym_root A B C J K M N P Q T V X Y Z;
	define date / group;
	define sym_root / group;
	define A/sum 'A';
	define B/sum 'B';
	define C/sum 'C';
	define J/sum 'J';
	define K/sum 'K';
	define M/sum 'M';
	define N/sum 'N';
	define P/sum 'P';
	define Q/sum 'Q';
	define T/sum 'T';
	define V/sum 'V';
	define X/sum 'X';
	define Y/sum 'Y';
	define Z/sum 'Z';
run;

I use 'Inforce' as variable name to calculate the time difference. because the normal stock time is from 9:30 to 16:30. so as you can see in my code, I use

inforce = max(("16:00:00.000000000"t-time_m),0)

to get the last inforce value.

However, one of my code problem is sometimes more than 1 exchange will hold the quote. I tried use lag(inforce) and code ' if inforce=0 then inforce=lag_inforce'. but it doesn't work when there are more than two exchange hold quote at the same time.

I really appreciate if someone can help me solve this issue. please see the attachment of part of rawdata.

mkeintz · Posted 12-22-2018 12:20 PM

If at all possible don't sort large data sets. And while each daily quote file is a moderately size data set, typically with over 6,000 stocks each a new best-bid or best-offer every second, you probably want to look at a range of dates.

So here's how to (1) use the fact that the data is already sorted by sym_root/sym_suffix/time_m, (2) look ahead one record to see what the next time will be, thereby calculating the inforce time (3) use the data step to do the daily summary.

data summaries (keep=sym_root sym_suffix date sumtime_: ntime_:)  ;
    
  set have (keep=sym_root sym_suffix date time_m where=('09:30:00't <= time_m <= '16:00:00't));
  by date sym_root sym_suffix ex notsorted;

  merge have 
        have (firstobs=2 keep=time_m rename=(time_m=_nxt_time_m));
  where '09:30:00't <= time_m <= '16:00:00't ;

  retain _exch_list 'ABCJKMNPQTVXYZ';
  array _sumtim {*} SUMTIME_A SUMTIME_B SUMTIME_C SUMTIME_J SUMTIME_K
                    SUMTIME_M SUMTIME_N SUMTIME_P SUMTIME_Q SUMTIME_T
                    SUMTIME_V SUMTIME_X SUMTIME_Y SUMTIME_Z ;
  array _ntim {*}   NTIME_A NTIME_B NTIME_C NTIME_J NTIME_K
                    NTIME_M NTIME_N NTIME_P NTIME_Q NTIME_T
                    NTIME_V NTIME_X NTIME_Y NTIME_Z ;
  retain sumtime_: ntime_: ;

  if first.sym_suffix then do _i=1 to dim(_sumtim); 
    _sumtim{_i}=0;  _ntim{_i}=0;
  end;
  if last.sym_suffix then _nxt_time_m='16:00:00't;

  _i=index(_exch_list,ex);
  _sumtim{_i}+ (_nxt_time_m-time_m);
  if last.ex then _ntim{_i}+1;
  if last.sym_suffix then output summaries;
run;

The SET statement only keeps the BY variables (I've added DATE since you might want to read multiple dates at one time.
1. I added SYM_SUFFIX since some companies issues multiple stock classes.
2. I've added TIME_M to read only records during market hours, which you can see used in the "WHERE=" parameter.
3. The data source you are is WRDS, and each daily data set is sorted by DATE (a constant), SYM_ROOT, SYM_SUFFIX, TIME_M, (and a quote sequence number whose varname I've forgotten).
4. Keeping the BY variables allows a simple way to initialize values for sum of time-spans and count of time-spans at the start of each stock, and to assign a _NXT_TIME_M='16:00:00't at the last market hour quote.
The MERGE statement merges each record of HAVE with the next record. But the next record keeps only the TIME_M value, renamed to _NXT_TIME_M to avoid overwriting TIME_M from the current record. This is the way to avoid descending sorts.
1. But note the MERGE statement must be accompanied by a where statement that exactly replicate the where filter applied to the set statement.
The index function finds the position of your exchange id in a sequence of exchange characters. So exchange A has index 1, exchange J has index 4, etc. This provides a direct way to add time values to the corresponding variables in the _SUMTIME and _NTIM variable arrays.
1. Also note the NTIME_ variables don't count quote records. Instead they count the number of "inforce" time spans. I.e. while a sequence of 3 EX=A records adds all 3 inforce times to SUMTIME_A, it only increments NTIME_A by 1. But if this is not what you want, the program can easily be modified.
2. This is why you see EX as the last by-variable. It provides a way to know when a given exchange starts (and ends) a series of quotes. It's also why you see NOTSORTED, since exchanges may have clusters of consecutive records, but they won't be in sorted order.

Finally, this data step generates totals, so you don't need the proc report to re-read all the data just to summarize. But if you actually want to keep all the INFORCE time spans, you could use the DATA step to simultaneously generate the summary data set as well as a data set keeping each quote, with the additional inforce time variable. Let me know if this is wanted.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

ballardw · Posted 12-21-2018 03:30 PM

1) define " was a given exchange holding the quote for a given stock"

Since your data does not have an actual quote value "holding" seems problematic.

Something along these lines:

proc summary data=rawdata nway;
   class sym_root ex date;
   var time_m;
   output out=work.rawsummary (drop=_type_ _freq_) range=   ;
run;

will get the range of your time_m variable (latest - first) for each date, symbol an ex. The value would be seconds though the code above will inherit the time format from time_m.

If you want something where "holding" refers to the same value of a quote you will need to include that quote and add that to the proc summary code.

You N exchange gets a 0 because there is only one record in the data. So there is no actual "duration" to calculate.

LZHAO006 · Posted 12-21-2018 07:09 PM

Hi ballardw, thank you very much for help!

I see your code. sorry for the unclear explanation, this is what I'm looking for.... Let's say just for one stock A, people bid A with different price quote , and at 9:30:0000, someone in exchange N bid the highest quote.

Sym_root Time Exchange

A 9:30:0000 N

and this best bid quote keep in 3 minute until other exchange Q provide a higher bid quote. and then five minute later, another exchange Z provide a new highest bid quote...

Sym_root Time Exchange

A 9:30:0000 N

A 9:33:0000 Q

A 9:38:0000 Z

what I want to do is I want calculate how many minutes did each exchange hold the best bid quote for a given stock. ( in my example, exchange N hold 3 min and exchange Q hold 5 min)

Because my data is huge and I have more than 3000 stocks in my database, therefore I would like arrange the column as followings. and I also want to convert time from second to minute.

PGStats · Posted 12-21-2018 05:12 PM

How about

proc sql;
select
    sym_root,
    date,
    ex,
    range(time_m) as holdingDuration format=time20.9
from sasforum.rawdata
group by sym_root, date, ex;
quit;

                                     Exchange
                                     that
                Security             issued
                symbol      Date of  the
                root          quote  quote          holdingDuration
                ---------------------------------------------------
                AKS       01OCT2018  B            0:00:01.005179000
                AKS       01OCT2018  J            0:00:00.041457000
                AKS       01OCT2018  K            0:00:01.006823000
                AKS       01OCT2018  N            0:00:00.000000000
                AKS       01OCT2018  P            0:00:01.006679000
                AKS       01OCT2018  T            0:00:01.005284000
                AKS       01OCT2018  X            0:00:00.039256000
                AKS       01OCT2018  Y            0:00:00.039171000
                AKS       01OCT2018  Z            0:00:01.001942000

Seems odd to me that all records about a given stock at some exchange on a given day would refer to the same quote...

PG

LZHAO006 · Posted 12-21-2018 07:12 PM

Hi PG, thank you very much for help me,

Sorry for the unclear explanation, this is what I'm looking for....

Let's say just for one stock A, people bid A with different price quote , and at 9:30:0000, someone in exchange N bid the highest quote.

Sym_root Time Exchange

A 9:30:0000 N

and this best bid quote keep in 3 minute until other exchange Q provide a higher bid quote. and then five minute later, another exchange Z provide a new highest bid quote...

Sym_root Time Exchange

A 9:30:0000 N

A 9:33:0000 Q

A 9:38:0000 Z

what I want to do is I want calculate how many minutes did each exchange hold the best bid quote for a given stock. ( in my example, exchange N hold 3 min and exchange Q hold 5 min)

Because my data is huge and I have more than 3000 stocks in my database, therefore I would like arrange the column as followings. and I also want to convert time from second to minute.

PGStats · Posted 12-22-2018 12:40 AM

Not sure if you want max or total, and what to do when the exchange repeats, but here is a start

proc sort data=sasforum.rawdata out=rawData; by sym_root date time_m; run;

data temp;
set rawdata(keep=time_m rename=time_m=nextTime);
if _n_ > 1 then do;
    set rawdata; by sym_root date;
    if not last.date then holdTime = nextTime - time_m;
    output;
    end;
run;

proc sql;
create table wantList as
select
    sym_root,
    date,
    ex,
    max(holdTime) as longestHoldingDuration format=mmss.,
    sum(holdTime) as totalHoldingDuration format=mmss.
from temp
group by sym_root, date, ex;
quit;

proc transpose data=wantList out=want(drop=_name_);
by sym_root date;
var totalHoldingDuration;
id ex;
run;

PG

LZHAO006 · Posted 12-22-2018 11:34 AM

Hi PG, thanks again! I use your code in my dataset, however, I found that the 'holdTime' will be 0 if there are more than one exchange have the highest bid quote at the same time. please see the following picture where I highlight

( at time 9:30:00:004505000, exchange T,X,J have the same highest quote price, and these three exchange hold the best quote for 0.000021 second until exchange T, X,B provide the higher same bid quote price at 9:30:00.004526000 )

I see the code calculate the hold time (0.000021) for exchange J at the time that I highlighted , could you tell me how can I also let SAS calculate the hold time (0.000021 same as exchange J) for exchange T and X? Thank you very much!

mkeintz · Posted 12-22-2018 12:20 PM

If at all possible don't sort large data sets. And while each daily quote file is a moderately size data set, typically with over 6,000 stocks each a new best-bid or best-offer every second, you probably want to look at a range of dates.

So here's how to (1) use the fact that the data is already sorted by sym_root/sym_suffix/time_m, (2) look ahead one record to see what the next time will be, thereby calculating the inforce time (3) use the data step to do the daily summary.

data summaries (keep=sym_root sym_suffix date sumtime_: ntime_:)  ;
    
  set have (keep=sym_root sym_suffix date time_m where=('09:30:00't <= time_m <= '16:00:00't));
  by date sym_root sym_suffix ex notsorted;

  merge have 
        have (firstobs=2 keep=time_m rename=(time_m=_nxt_time_m));
  where '09:30:00't <= time_m <= '16:00:00't ;

  retain _exch_list 'ABCJKMNPQTVXYZ';
  array _sumtim {*} SUMTIME_A SUMTIME_B SUMTIME_C SUMTIME_J SUMTIME_K
                    SUMTIME_M SUMTIME_N SUMTIME_P SUMTIME_Q SUMTIME_T
                    SUMTIME_V SUMTIME_X SUMTIME_Y SUMTIME_Z ;
  array _ntim {*}   NTIME_A NTIME_B NTIME_C NTIME_J NTIME_K
                    NTIME_M NTIME_N NTIME_P NTIME_Q NTIME_T
                    NTIME_V NTIME_X NTIME_Y NTIME_Z ;
  retain sumtime_: ntime_: ;

  if first.sym_suffix then do _i=1 to dim(_sumtim); 
    _sumtim{_i}=0;  _ntim{_i}=0;
  end;
  if last.sym_suffix then _nxt_time_m='16:00:00't;

  _i=index(_exch_list,ex);
  _sumtim{_i}+ (_nxt_time_m-time_m);
  if last.ex then _ntim{_i}+1;
  if last.sym_suffix then output summaries;
run;

The SET statement only keeps the BY variables (I've added DATE since you might want to read multiple dates at one time.
1. I added SYM_SUFFIX since some companies issues multiple stock classes.
2. I've added TIME_M to read only records during market hours, which you can see used in the "WHERE=" parameter.
3. The data source you are is WRDS, and each daily data set is sorted by DATE (a constant), SYM_ROOT, SYM_SUFFIX, TIME_M, (and a quote sequence number whose varname I've forgotten).
4. Keeping the BY variables allows a simple way to initialize values for sum of time-spans and count of time-spans at the start of each stock, and to assign a _NXT_TIME_M='16:00:00't at the last market hour quote.
The MERGE statement merges each record of HAVE with the next record. But the next record keeps only the TIME_M value, renamed to _NXT_TIME_M to avoid overwriting TIME_M from the current record. This is the way to avoid descending sorts.
1. But note the MERGE statement must be accompanied by a where statement that exactly replicate the where filter applied to the set statement.
The index function finds the position of your exchange id in a sequence of exchange characters. So exchange A has index 1, exchange J has index 4, etc. This provides a direct way to add time values to the corresponding variables in the _SUMTIME and _NTIM variable arrays.
1. Also note the NTIME_ variables don't count quote records. Instead they count the number of "inforce" time spans. I.e. while a sequence of 3 EX=A records adds all 3 inforce times to SUMTIME_A, it only increments NTIME_A by 1. But if this is not what you want, the program can easily be modified.
2. This is why you see EX as the last by-variable. It provides a way to know when a given exchange starts (and ends) a series of quotes. It's also why you see NOTSORTED, since exchanges may have clusters of consecutive records, but they won't be in sorted order.

Finally, this data step generates totals, so you don't need the proc report to re-read all the data just to summarize. But if you actually want to keep all the INFORCE time spans, you could use the DATA step to simultaneously generate the summary data set as well as a data set keeping each quote, with the additional inforce time variable. Let me know if this is wanted.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

mkeintz · Posted 12-22-2018 06:58 PM

How do you know that two exchanges with the same time stamps means they are both at national best bid (NBB) or best offer (NBO)? Is that how the data set was constructed? I ask because that is not how the TAQ (trade-and-quote) data is distributed by the NYSE.

Also sometimes a tie occurs, not when two exchanges simultaneously issue a new best bid or offer, but rather when one exchange follows (after a time lag) another at NBB or NBO. So the other side of my question is, are you ignoring ties when one exchange lags behind another?

Now you may have considered these issues already and perhaps your strategy is to assign the INFORCE time only to the first exchange at BB or BO, and you don't see how to identify the "first" for (apparent) simultaneity. But, even though the time stamps are identical, the actual data arrive at the consolidated re-publisher in a series (in fact they have different quote identifiers representing their order). Given that much trading is based on high-frequency programmed trading, it's quite likely that actual trading is in response to the first quote in a pair of identical timestamps. I.e. you might want to treat the situation of equal timestamps for the same NBB or NBO, as you would for unequal timestamps at the same NBB or NBO.

. And a programmed

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

LZHAO006 · Posted 12-23-2018 11:34 AM

Hi Mkeintz,

I didn't consider the quote sequence number when I program my code. With your explanation, now I have more understanding towards the TAQ data. Thank you very much for cleaning up my confusion! you are a life saver 🙂

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!