Converting irregular time-series observations to periodical observatio...

SK12376574 · Posted 11-15-2017 06:40 PM

Hi,

I'm writing a thesis using tick-by-tick stock exchange data with "time irregular" trade price observations.

I want to "fix" the observation to periodical observations relative to the trade day open time (denoted as OpenFlag).

I used Proc SQL to do this conversion (see attached below / attached file), but it is very inefficient (my dataset is big with ~500,000 observations each day * 3,000 days * 30 variables).

Can this be done more efficiently using Proc Expand or Proc Timeseries? I already burned one hard disk trying ....

I appriciate any help I can get.

Thanks,

Saggi

Here is the code I use:

/* Create example dataset */
Data Trades;
input TimeStamp time9. X OpenFlag;
Format TimeStamp time9.;
datalines;
09:00:00 61 0
09:00:01 98 0
09:00:02 54 0
09:00:03 93 0
09:00:04 83 0
09:00:10 42 0
09:00:11 62 1
09:00:12 46 0
09:00:13 60 0
09:00:14 92 0
09:00:15 54 0
09:00:16 29 0
;
run;

/*Create Fix 2 seconds time interval table */
Data FT;
set Trades;
if OpenFlag = 1;
rename TimeStamp=OpenTime;
run;
Data FT;
set FT;
Retain OpenTime FTime;
Do i= -60 to 60 by 2;
FTime=OpenTime+i;
output;
end;
Format FTime time9.;
Keep OpenTime FTime;
run;

/* Merge and create a fix time trades dataset */
PROC SQL;
CREATE TABLE WORK.FT_2Sec_Trades AS
SELECT t1.FTime,
t2.TimeStamp,
t2.OpenFlag,
t2.X
FROM FT t1
LEFT JOIN Trades t2 ON (t1.FTime >= t2.TimeStamp)
Group by t1.FTime having t2.TimeStamp = max(t2.TimeStamp)
ORDER BY t1.FTime,
t2.TimeStamp;
QUIT;

mkeintz · Posted 11-17-2017 07:57 AM

For dealing with sorted large datasets, and producing datasets compare the sort key in one record to neighboring records, I have never seen SQL come anywhere close to a well-formed data step:

Data Trades;
input TimeStamp time9. X OpenFlag;
Format TimeStamp time9.;
datalines;
09:00:00 61 0
09:00:01 98 0
09:00:02 54 0
09:00:03 93 0
09:00:04 83 0
09:00:10 42 0
09:00:11 62 1
09:00:12 46 0
09:00:13 60 0
09:00:14 92 0
09:00:15 54 0
09:00:16 29 0
;
run;

data want;
  set trades (where=(openflag=1))
      trades (where=(openflag=0));
  /*by ticker;*/

  if openflag=1 then opentime=timestamp;
  retain opentime;
  format opentime time8.0;

  if abs(timestamp-opentime)<=60;
  fixed_time= round(1+timestamp-opentime,2);
  format fixed_time time8.0;
run;

This program assumes the data are sorted by timestamp. If the data are actually sorted by ticker/time, put in a "by ticker;" statement.

Edits:

If your data is sorted by ticker/timestamp, then use the "by ticker;" statement.

Also this program assumes 1 openflag=1 record per ticker.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SK12376574 · Posted 11-19-2017 12:03 PM

Thanks for the prompt reply.

Since I need the code to extrapolate the last known value, I changed the code as following.

As you can see there are two "missing" observations 09:00:07 and 09:00:09 both of them should get the last known value of x=83.

I will appreciate any help in filling out the missing observation in an efficient way (I used proc expand but it is sort of a black box for me).

Thanks,

SK

Data Trades;
input TimeStamp time9. X OpenFlag;
Format TimeStamp time9.;
datalines;
09:00:00 61 0
09:00:01 98 0
09:00:02 54 0
09:00:03 93 0
09:00:04 83 0
09:00:10 42 0
09:00:11 62 1
09:00:12 46 0
09:00:13 60 0
09:00:14 92 0
09:00:15 54 0
09:00:16 29 0
;
run;
data want;
set trades (where=(openflag=1))
trades (where=(openflag=0));
if openflag=1 then opentime=timestamp;
retain opentime;
format opentime time8.0;
fixed_time=opentime+ceil((timestamp-opentime)/2)*2;
format fixed_time time8.0;
run;
Proc Sort data=want;
by fixed_time timestamp;
run;
Data Want;
set want;
by fixed_time timestamp;
if last.fixed_time;
run;

mkeintz · Posted 11-19-2017 02:18 PM

x

Why are you doing a PROC SORT, by fixed_time/timestamp?
Fixed_time is a monotonic step function of timestamp. So if the incoming data is sorted by timestamp, the result is already sorted by fixed_time/timestamp. It would be a waste of disk input/output activity and time.
If all you want is the last valid value at a given fixed_time, you can do that selection in the first data step. You don't really need an extra DATA step - again a waste of disk activity.

But if you prefer that technique, you can eliminate the disk activity by making the first data set a dataset view, and then read it to make the second data set as a dataset file, as in:
     data need / view=need;
       set trades .....;
       .....
     run;
     data want;
       set need;
       by fixed_time;
       if last.fixed_time;
     run;

A data set view doesn't write data to disk, it processes the data only when referenced in a later step. Then instead of writing to disk, it streams the data to the calling step.

As to holes in your timestamps, you can change the DATA WANT step to

data want (drop=nxt_:);
merge need

need (firstobs=2 keep=fixed_time rename=(fixed_time=nxt_fixed_time)) end=end_of_merge;

if end_of_merge=0 then do fixed_time=fixed_time to nxt_fixed_time-2 by 2;

output;

end;

run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SK12376574 · Posted 11-20-2017 11:09 AM

Thanks again for the help.

As for your remarks:

I'm using the proc sort because the first data step used to merge the open time places the opentime observation as the first row whereas according to its timestamp (09:00:11) it should be in the middle.
In your reply, you mentioned "you can do that selection in the first data step" - how?
I didn't succeed doing that (I can do it using proc sql).
The code for "filling the holes" got different results than I expected, I would greatly appreciate it if you can elaborate.

Here are the results I'm looking for (the red rows are the "holes" I filled manually):

TimeStamp	X	OpenFlag	opentime	fixed_time
09:00:01	98	0	09:00:11	09:00:01
09:00:03	93	0	09:00:11	09:00:03
09:00:04	83	0	09:00:11	09:00:05
09:00:04	83	0	09:00:11	09:00:07
09:00:04	83	0	09:00:11	09:00:09
09:00:11	62	1	09:00:11	09:00:11
09:00:13	60	0	09:00:11	09:00:13
09:00:15	54	0	09:00:11	09:00:15
09:00:16	29	0	09:00:11	09:00:17

Thanks again,

SK

mkeintz · Posted 11-20-2017 11:55 PM

@SK12376574 wrote:

Thanks again for the help.

As for your remarks:

I'm using the proc sort because the first data step used to merge the open time places the opentime observation as the first row whereas according to its timestamp (09:00:11) it should be in the middle.

In your reply, you mentioned "you can do that selection in the first data step" - how?
I didn't succeed doing that (I can do it using proc sql).

The code for "filling the holes" got different results than I expected, I would greatly appreciate it if you can elaborate.

Here are the results I'm looking for (the red rows are the "holes" I filled manually):

TimeStamp X OpenFlag opentime fixed_time

09:00:01 98 0 09:00:11 09:00:01

09:00:03 93 0 09:00:11 09:00:03

09:00:04 83 0 09:00:11 09:00:05

09:00:04 83 0 09:00:11 09:00:07

09:00:04 83 0 09:00:11 09:00:09

09:00:11 62 1 09:00:11 09:00:11

09:00:13 60 0 09:00:11 09:00:13

09:00:15 54 0 09:00:11 09:00:15

09:00:16 29 0 09:00:11 09:00:17

Thanks again,

SK

As to your first comment: if you want a copy of the openflag record in it's proper order, then drop the second "where=" parameter. I.e. change

data want;
set trades (where=(openflag=1))
trades (where=(openflag=0));
/*by ticker;*/

to

data want;
set trades (where=(openflag=1))
trades ;
/*by ticker;*/

This allows the openflag=1 record to be read twice: once at the beginning (to extablish opentime) and a second time in chronological order, to provide other data for that trade.

More generally, this is a one-step program:

data Trades;
  input TimeStamp time9. X OpenFlag;
  Format TimeStamp time9.;
datalines;
09:00:00 61 0
09:00:01 98 0
09:00:02 54 0
09:00:03 93 0
09:00:04 83 0
09:00:10 42 0
09:00:11 62 1
09:00:12 46 0
09:00:13 60 0
09:00:14 92 0
09:00:15 54 0
09:00:16 29 0
;
run;

data want (drop=nxt_:  _:);
  /* Read the openflag trade, followed by all trades (include 2nd read of openflag trade*/
  set trades (where=(openflag=1) in=optrade)
      trades (in=alltrades)  
      end=end_of_tr;
  /*by ticker; */

  if optrade then do;  /* if optrade set opentime and delete this record ... it will be read again*/
    opentime=timestamp;
    fixed_opentime=round(1+opentime,2);
    _open_minus_60=fixed_opentime-60;    /*Minimum acceptable fixed_time*/
    _open_plus_60=fixed_opentime+60;     /*Maximum acceptable fixed_time*/
    delete;
  end;
  retain opentime fixed_opentime _open_minus_60 _open_plus_60;
  format opentime fixed_opentime _open_minus_60 _open_plus_60 time8.0;

  /* Read ahead one record to get next fixed_time*/
  if end_of_tr=0 then set trades (firstobs=2 keep=timestamp rename=(timestamp=nxt_timestamp));
  else nxt_timestamp=_open_plus_60+2;

  fixed_time=round(1+timestamp,2);
  nxt_fixed_time=round(1+nxt_timestamp,2);
  format fixed_time nxt_fixed_time time8.0;

  if nxt_fixed_time=fixed_time then delete;     /*keep only final rec for each fixed_time*/
  if nxt_fixed_time<_open_minus_60 then delete; /*drop records that are too early*/

  if fixed_time>_open_plus_60 then stop; /*replace "stop" by "delete" if sorted by ticker/timestamp */

  do fixed_time=max(fixed_time,_open_minus_60) to min(nxt_fixed_time-2,_open_plus_60) by 2;
    output;
  end;
run;

If the data set has multiple tickers, (i.e. sorted by ticker/timestamp instead of just timestamp), then you should do 2 things:

Uncomment the /*by ticker;*/ statement
Change "stop" to "delete" in the "if fixed_time>_open_plus_60 then stop;" statement.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SK12376574 · Posted 11-28-2017 04:55 PM

I apologize for the delayed response - I fell ill.

Thanks for the complete solution, I highly appreciate it.

SK

SK12376574 · Posted 11-28-2017 05:36 PM

Sorry for the hassle, but I just notice the solution gives the wrong fixed_time (every 2 seconds relative to 09:12 instead of 09:11).

I changed the code a bit by subtracting 1 from the fixed_time and fixed_opentime variables (I haven't yet fully understand how the code works), so it seems to give the required result:

Thanks again,

SK

data want (drop=nxt_: _: fixed_opentime);
/* Read the openflag trade, followed by all trades (include 2nd read of openflag trade*/
set trades (where=(openflag=1) in=optrade)
trades (in=alltrades)
end=end_of_tr;
/*by ticker; */

if optrade then do; /* if optrade set opentime and delete this record ... it will be read again*/
opentime=timestamp;
fixed_opentime=round(1+opentime,2)-1;
_open_minus_60=fixed_opentime-60; /*Minimum acceptable fixed_time*/
_open_plus_60=fixed_opentime+60; /*Maximum acceptable fixed_time*/
delete;
end;
retain opentime fixed_opentime _open_minus_60 _open_plus_60;
format opentime fixed_opentime _open_minus_60 _open_plus_60 time8.0;

/* Read ahead one record to get next fixed_time*/
if end_of_tr=0 then set trades (firstobs=2 keep=timestamp rename=(timestamp=nxt_timestamp));
else nxt_timestamp=_open_plus_60+2;

fixed_time=round(1+timestamp,2)-1;
nxt_fixed_time=round(1+nxt_timestamp,2);
format fixed_time nxt_fixed_time time8.0;

if nxt_fixed_time=fixed_time then delete; /*keep only final rec for each fixed_time*/
if nxt_fixed_time<_open_minus_60 then delete; /*drop records that are too early*/

if fixed_time>_open_plus_60 then stop; /*replace "stop" by "delete" if sorted by ticker/timestamp */

do fixed_time=max(fixed_time,_open_minus_60) to min(nxt_fixed_time-2,_open_plus_60) by 2;
output;
end;
run;

Converting irregular time-series observations to periodical observations?

Re: Converting irregular time-series observations to periodical observations?

Re: Converting irregular time-series observations to periodical observations?

Re: Converting irregular time-series observations to periodical observations?

Re: Converting irregular time-series observations to periodical observations?

Re: Converting irregular time-series observations to periodical observations?

Re: Converting irregular time-series observations to periodical observations?

Re: Converting irregular time-series observations to periodical observations?

The 2025 SAS Hackathon has begun!

SAS Training: Just a Click Away