BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
LZHAO006
Fluorite | Level 6

Hi I have a stock quote data set with exchange information. I want to know how many minutes was a given exchange holding the quote for a given stock. Here's part of my raw data ( the column name is DATE Sym_root Time_M and EX) 

raw data.JPG

  • Here's my current code
proc sort data= rawdata;
by sym_root descending time_m;
run;

data test; set rawdata; by sym_root descending time_m; inforce = abs(dif(time_m)); if first.sym_root then inforce = max(("16:00:00.000000000"t-time_m),0); run; data test; set test; if EX = 'A' then A = inforce/60; else A =0; if EX = 'B' then B = inforce/60; else B =0; if EX = 'C' then C = inforce/60; else C =0; if EX = 'J' then J = inforce/60; else J =0; if EX = 'K' then K = inforce/60; else K =0; if EX = 'M' then M = inforce/60; else M =0; if EX = 'N' then N = inforce/60; else N =0; if EX = 'P' then P = inforce/60; else P =0; if EX = 'Q' then Q = inforce/60; else Q =0; if EX = 'T' then T = inforce/60; else T =0; if EX = 'V' then V = inforce/60; else V =0; if EX = 'X' then X = inforce/60; else X =0; if EX = 'Y' then Y = inforce/60; else Y =0; if EX = 'Z' then Z = inforce/60; else Z =0; run; proc report data=test nowd out= timeshare (drop=_BREAK_); column date sym_root A B C J K M N P Q T V X Y Z; define date / group; define sym_root / group; define A/sum 'A'; define B/sum 'B'; define C/sum 'C'; define J/sum 'J'; define K/sum 'K'; define M/sum 'M'; define N/sum 'N'; define P/sum 'P'; define Q/sum 'Q'; define T/sum 'T'; define V/sum 'V'; define X/sum 'X'; define Y/sum 'Y'; define Z/sum 'Z'; run;
  • I use 'Inforce' as variable name to calculate the time difference. because the normal stock time is from 9:30 to 16:30. so as you can see in my code, I use
inforce = max(("16:00:00.000000000"t-time_m),0)

to get the last inforce value. 

 

However, one of my code problem is sometimes more than 1 exchange will hold the quote. I tried use lag(inforce) and code ' if inforce=0 then inforce=lag_inforce'. but it doesn't work when there are more than two exchange hold quote at the same time. 

 

problem.JPG

 

I really appreciate if someone can help me solve this issue. please see the attachment of part of rawdata.

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

If at all possible don't sort large data sets.  And while each daily quote file is a moderately size data set, typically with over 6,000 stocks each a new best-bid or best-offer every second, you probably want to look at a range of dates.

 

So here's how to (1) use the fact that the data is already sorted by sym_root/sym_suffix/time_m, (2) look ahead one record to see what the next time will be, thereby calculating the inforce time (3) use the data step to do the daily summary.

 

data summaries (keep=sym_root sym_suffix date sumtime_: ntime_:)  ;
    
  set have (keep=sym_root sym_suffix date time_m where=('09:30:00't <= time_m <= '16:00:00't));
  by date sym_root sym_suffix ex notsorted;

  merge have 
        have (firstobs=2 keep=time_m rename=(time_m=_nxt_time_m));
  where '09:30:00't <= time_m <= '16:00:00't ;

  retain _exch_list 'ABCJKMNPQTVXYZ';
  array _sumtim {*} SUMTIME_A SUMTIME_B SUMTIME_C SUMTIME_J SUMTIME_K
                    SUMTIME_M SUMTIME_N SUMTIME_P SUMTIME_Q SUMTIME_T
                    SUMTIME_V SUMTIME_X SUMTIME_Y SUMTIME_Z ;
  array _ntim {*}   NTIME_A NTIME_B NTIME_C NTIME_J NTIME_K
                    NTIME_M NTIME_N NTIME_P NTIME_Q NTIME_T
                    NTIME_V NTIME_X NTIME_Y NTIME_Z ;
  retain sumtime_: ntime_: ;

  if first.sym_suffix then do _i=1 to dim(_sumtim); 
    _sumtim{_i}=0;  _ntim{_i}=0;
  end;
  if last.sym_suffix then _nxt_time_m='16:00:00't;

  _i=index(_exch_list,ex);
  _sumtim{_i}+ (_nxt_time_m-time_m);
  if last.ex then _ntim{_i}+1;
  if last.sym_suffix then output summaries;
run;

 

  1. The SET statement only keeps the BY variables (I've added DATE since you might want to read multiple dates at one time.
    1. I added SYM_SUFFIX since some companies issues multiple stock classes.
    2. I've added TIME_M to read only records during market hours, which you can see used in the "WHERE=" parameter.
    3. The data source you are is WRDS, and each daily data set is sorted by DATE (a constant), SYM_ROOT, SYM_SUFFIX, TIME_M, (and a quote sequence number whose varname I've forgotten).
    4. Keeping the BY variables allows a simple way to initialize values for sum of time-spans and count of time-spans at the start of each stock, and to assign a _NXT_TIME_M='16:00:00't at the last market hour quote.
  2. The MERGE statement merges each record of HAVE with the next record.  But the next record keeps only the TIME_M value, renamed to _NXT_TIME_M to avoid overwriting TIME_M from the current record.  This is the way to avoid descending sorts.
    1. But note the MERGE statement must be accompanied by a where statement that exactly replicate the where filter applied to the set statement.
  3. The index function finds the position of your exchange id in a sequence of exchange characters.  So exchange A has index 1, exchange J has index 4, etc.  This provides a direct way to add time values to the corresponding variables in the _SUMTIME and _NTIM variable arrays.
    1. Also note the NTIME_ variables don't count quote records.  Instead they count the number of "inforce" time spans.  I.e. while a sequence of 3 EX=A records adds all 3 inforce times to SUMTIME_A, it only increments NTIME_A by 1.  But if this is not what you want, the program can easily be modified.
    2. This is why you see EX as the last by-variable.  It provides a way to know when a given exchange starts (and ends) a series of quotes.  It's also why you see NOTSORTED, since exchanges may have clusters of consecutive records, but they won't be in sorted order.

Finally, this data step generates totals, so you don't need the proc report to re-read all the data just to summarize.  But if you actually want to keep all the INFORCE time spans, you could use the DATA step to simultaneously generate the summary data set as well as a data set keeping each quote, with the additional inforce time variable.  Let me know if this is wanted.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

9 REPLIES 9
ballardw
Super User

1) define " was a given exchange holding the quote for a given stock"

Since your data does not have an actual quote value "holding" seems problematic.

 

Something along these lines:

proc summary data=rawdata nway;
   class sym_root ex date;
   var time_m;
   output out=work.rawsummary (drop=_type_ _freq_) range=   ;
run;

will get the range of your time_m variable  (latest - first) for each date, symbol an ex. The value would be seconds though the code above will inherit the time format from time_m.

 

 

If you want something where "holding" refers to the same value of a quote you will need to include that quote and add that to the proc summary code.

 

You N exchange gets a 0 because there is only one record in the data. So there is no actual "duration" to calculate.

LZHAO006
Fluorite | Level 6

Hi ballardw, t

 

 

 

 

PGStats
Opal | Level 21

How about

 

proc sql;
select
    sym_root,
    date,
    ex,
    range(time_m) as holdingDuration format=time20.9
from sasforum.rawdata
group by sym_root, date, ex;
quit;
                                     Exchange
                                     that
                Security             issued
                symbol      Date of  the
                root          quote  quote          holdingDuration
                ---------------------------------------------------
                AKS       01OCT2018  B            0:00:01.005179000
                AKS       01OCT2018  J            0:00:00.041457000
                AKS       01OCT2018  K            0:00:01.006823000
                AKS       01OCT2018  N            0:00:00.000000000
                AKS       01OCT2018  P            0:00:01.006679000
                AKS       01OCT2018  T            0:00:01.005284000
                AKS       01OCT2018  X            0:00:00.039256000
                AKS       01OCT2018  Y            0:00:00.039171000
                AKS       01OCT2018  Z            0:00:01.001942000

Seems odd to me that all records about a given stock at some exchange on a given day would refer to the same quote...

PG
LZHAO006
Fluorite | Level 6

Hi PG, thank you very much for help me, 

 

 

 

 

PGStats
Opal | Level 21

Not sure if you want max or total, and what to do when the exchange repeats, but here is a start

 

proc sort data=sasforum.rawdata out=rawData; by sym_root date time_m; run;

data temp;
set rawdata(keep=time_m rename=time_m=nextTime);
if _n_ > 1 then do;
    set rawdata; by sym_root date;
    if not last.date then holdTime = nextTime - time_m;
    output;
    end;
run;

proc sql;
create table wantList as
select
    sym_root,
    date,
    ex,
    max(holdTime) as longestHoldingDuration format=mmss.,
    sum(holdTime) as totalHoldingDuration format=mmss.
from temp
group by sym_root, date, ex;
quit;

proc transpose data=wantList out=want(drop=_name_);
by sym_root date;
var totalHoldingDuration;
id ex;
run;
PG
LZHAO006
Fluorite | Level 6

Hi PG, thanks again! I use your code in my dataset, however, I found that the 'holdTime' will be 0 if there are more than one exchange have the highest bid quote at the same time. please see the following picture where I highlight

( at time 9:30:00:004505000, exchange T,X,J have the same highest quote price, and these three exchange hold the best quote for 0.000021 second until exchange T, X,B provide the higher same bid quote price at 9:30:00.004526000 )

 ex probelm.JPG

 

I see the code calculate the hold time (0.000021) for exchange J at the time that I highlighted , could you tell me how can I also let SAS calculate the hold time (0.000021 same as exchange J) for exchange T and X?  Thank you very much! 

mkeintz
PROC Star

If at all possible don't sort large data sets.  And while each daily quote file is a moderately size data set, typically with over 6,000 stocks each a new best-bid or best-offer every second, you probably want to look at a range of dates.

 

So here's how to (1) use the fact that the data is already sorted by sym_root/sym_suffix/time_m, (2) look ahead one record to see what the next time will be, thereby calculating the inforce time (3) use the data step to do the daily summary.

 

data summaries (keep=sym_root sym_suffix date sumtime_: ntime_:)  ;
    
  set have (keep=sym_root sym_suffix date time_m where=('09:30:00't <= time_m <= '16:00:00't));
  by date sym_root sym_suffix ex notsorted;

  merge have 
        have (firstobs=2 keep=time_m rename=(time_m=_nxt_time_m));
  where '09:30:00't <= time_m <= '16:00:00't ;

  retain _exch_list 'ABCJKMNPQTVXYZ';
  array _sumtim {*} SUMTIME_A SUMTIME_B SUMTIME_C SUMTIME_J SUMTIME_K
                    SUMTIME_M SUMTIME_N SUMTIME_P SUMTIME_Q SUMTIME_T
                    SUMTIME_V SUMTIME_X SUMTIME_Y SUMTIME_Z ;
  array _ntim {*}   NTIME_A NTIME_B NTIME_C NTIME_J NTIME_K
                    NTIME_M NTIME_N NTIME_P NTIME_Q NTIME_T
                    NTIME_V NTIME_X NTIME_Y NTIME_Z ;
  retain sumtime_: ntime_: ;

  if first.sym_suffix then do _i=1 to dim(_sumtim); 
    _sumtim{_i}=0;  _ntim{_i}=0;
  end;
  if last.sym_suffix then _nxt_time_m='16:00:00't;

  _i=index(_exch_list,ex);
  _sumtim{_i}+ (_nxt_time_m-time_m);
  if last.ex then _ntim{_i}+1;
  if last.sym_suffix then output summaries;
run;

 

  1. The SET statement only keeps the BY variables (I've added DATE since you might want to read multiple dates at one time.
    1. I added SYM_SUFFIX since some companies issues multiple stock classes.
    2. I've added TIME_M to read only records during market hours, which you can see used in the "WHERE=" parameter.
    3. The data source you are is WRDS, and each daily data set is sorted by DATE (a constant), SYM_ROOT, SYM_SUFFIX, TIME_M, (and a quote sequence number whose varname I've forgotten).
    4. Keeping the BY variables allows a simple way to initialize values for sum of time-spans and count of time-spans at the start of each stock, and to assign a _NXT_TIME_M='16:00:00't at the last market hour quote.
  2. The MERGE statement merges each record of HAVE with the next record.  But the next record keeps only the TIME_M value, renamed to _NXT_TIME_M to avoid overwriting TIME_M from the current record.  This is the way to avoid descending sorts.
    1. But note the MERGE statement must be accompanied by a where statement that exactly replicate the where filter applied to the set statement.
  3. The index function finds the position of your exchange id in a sequence of exchange characters.  So exchange A has index 1, exchange J has index 4, etc.  This provides a direct way to add time values to the corresponding variables in the _SUMTIME and _NTIM variable arrays.
    1. Also note the NTIME_ variables don't count quote records.  Instead they count the number of "inforce" time spans.  I.e. while a sequence of 3 EX=A records adds all 3 inforce times to SUMTIME_A, it only increments NTIME_A by 1.  But if this is not what you want, the program can easily be modified.
    2. This is why you see EX as the last by-variable.  It provides a way to know when a given exchange starts (and ends) a series of quotes.  It's also why you see NOTSORTED, since exchanges may have clusters of consecutive records, but they won't be in sorted order.

Finally, this data step generates totals, so you don't need the proc report to re-read all the data just to summarize.  But if you actually want to keep all the INFORCE time spans, you could use the DATA step to simultaneously generate the summary data set as well as a data set keeping each quote, with the additional inforce time variable.  Let me know if this is wanted.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
mkeintz
PROC Star

How do you know that two exchanges with the same time stamps means they are both at national best bid (NBB) or best offer (NBO)?  Is that how the data set was constructed?  I ask because that is not how the TAQ (trade-and-quote) data is distributed by the NYSE.

 

Also sometimes a tie occurs, not when two exchanges simultaneously issue a new best bid or offer, but rather when one exchange follows (after a time lag) another at NBB or NBO.  So the other side of my question is, are you ignoring ties when one exchange lags behind another?

 

Now you may have considered these issues already and perhaps your strategy is to assign the INFORCE time only to the first exchange at BB or BO, and you don't see how to identify the "first" for (apparent) simultaneity.  But, even though the time stamps are identical, the actual data arrive at the consolidated re-publisher in a series (in fact they have different quote identifiers representing their order).  Given that much trading is based on high-frequency programmed trading, it's quite likely that actual trading is in response to the first quote in a pair of identical timestamps.  I.e. you might want to treat the situation of equal timestamps for the same NBB or NBO, as you would for unequal timestamps at the same NBB or NBO.

 

.  And a programmed

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
LZHAO006
Fluorite | Level 6

Hi Mkeintz, 

I didn't consider the quote sequence number when I program my code. With your explanation, now I have more understanding towards the TAQ data. Thank you very much for cleaning up my confusion! you are a life saver 🙂

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 2136 views
  • 3 likes
  • 4 in conversation