Hi,
I'm writing a thesis using tick-by-tick stock exchange data with "time irregular" trade price observations.
I want to "fix" the observation to periodical observations relative to the trade day open time (denoted as OpenFlag).
I used Proc SQL to do this conversion (see attached below / attached file), but it is very inefficient (my dataset is big with ~500,000 observations each day * 3,000 days * 30 variables).
Can this be done more efficiently using Proc Expand or Proc Timeseries? I already burned one hard disk trying ....
I appriciate any help I can get.
Thanks,
Saggi
Here is the code I use:
/* Create example dataset */
Data Trades;
input TimeStamp time9. X OpenFlag;
Format TimeStamp time9.;
datalines;
09:00:00 61 0
09:00:01 98 0
09:00:02 54 0
09:00:03 93 0
09:00:04 83 0
09:00:10 42 0
09:00:11 62 1
09:00:12 46 0
09:00:13 60 0
09:00:14 92 0
09:00:15 54 0
09:00:16 29 0
;
run;
/*Create Fix 2 seconds time interval table */
Data FT;
set Trades;
if OpenFlag = 1;
rename TimeStamp=OpenTime;
run;
Data FT;
set FT;
Retain OpenTime FTime;
Do i= -60 to 60 by 2;
FTime=OpenTime+i;
output;
end;
Format FTime time9.;
Keep OpenTime FTime;
run;
/* Merge and create a fix time trades dataset */
PROC SQL;
CREATE TABLE WORK.FT_2Sec_Trades AS
SELECT t1.FTime,
t2.TimeStamp,
t2.OpenFlag,
t2.X
FROM FT t1
LEFT JOIN Trades t2 ON (t1.FTime >= t2.TimeStamp)
Group by t1.FTime having t2.TimeStamp = max(t2.TimeStamp)
ORDER BY t1.FTime,
t2.TimeStamp;
QUIT;
For dealing with sorted large datasets, and producing datasets compare the sort key in one record to neighboring records, I have never seen SQL come anywhere close to a well-formed data step:
Data Trades;
input TimeStamp time9. X OpenFlag;
Format TimeStamp time9.;
datalines;
09:00:00 61 0
09:00:01 98 0
09:00:02 54 0
09:00:03 93 0
09:00:04 83 0
09:00:10 42 0
09:00:11 62 1
09:00:12 46 0
09:00:13 60 0
09:00:14 92 0
09:00:15 54 0
09:00:16 29 0
;
run;
data want;
set trades (where=(openflag=1))
trades (where=(openflag=0));
/*by ticker;*/
if openflag=1 then opentime=timestamp;
retain opentime;
format opentime time8.0;
if abs(timestamp-opentime)<=60;
fixed_time= round(1+timestamp-opentime,2);
format fixed_time time8.0;
run;
This program assumes the data are sorted by timestamp. If the data are actually sorted by ticker/time, put in a "by ticker;" statement.
Edits:
If your data is sorted by ticker/timestamp, then use the "by ticker;" statement.
Also this program assumes 1 openflag=1 record per ticker.
Thanks for the prompt reply.
Since I need the code to extrapolate the last known value, I changed the code as following.
As you can see there are two "missing" observations 09:00:07 and 09:00:09 both of them should get the last known value of x=83.
I will appreciate any help in filling out the missing observation in an efficient way (I used proc expand but it is sort of a black box for me).
Thanks,
SK
Data Trades;
input TimeStamp time9. X OpenFlag;
Format TimeStamp time9.;
datalines;
09:00:00 61 0
09:00:01 98 0
09:00:02 54 0
09:00:03 93 0
09:00:04 83 0
09:00:10 42 0
09:00:11 62 1
09:00:12 46 0
09:00:13 60 0
09:00:14 92 0
09:00:15 54 0
09:00:16 29 0
;
run;
data want;
set trades (where=(openflag=1))
trades (where=(openflag=0));
if openflag=1 then opentime=timestamp;
retain opentime;
format opentime time8.0;
fixed_time=opentime+ceil((timestamp-opentime)/2)*2;
format fixed_time time8.0;
run;
Proc Sort data=want;
by fixed_time timestamp;
run;
Data Want;
set want;
by fixed_time timestamp;
if last.fixed_time;
run;
x
As to holes in your timestamps, you can change the DATA WANT step to
data want (drop=nxt_:);
merge need
need (firstobs=2 keep=fixed_time rename=(fixed_time=nxt_fixed_time)) end=end_of_merge;
if end_of_merge=0 then do fixed_time=fixed_time to nxt_fixed_time-2 by 2;
output;
end;
run;
run;
Thanks again for the help.
As for your remarks:
Here are the results I'm looking for (the red rows are the "holes" I filled manually):
TimeStamp | X | OpenFlag | opentime | fixed_time |
09:00:01 | 98 | 0 | 09:00:11 | 09:00:01 |
09:00:03 | 93 | 0 | 09:00:11 | 09:00:03 |
09:00:04 | 83 | 0 | 09:00:11 | 09:00:05 |
09:00:04 | 83 | 0 | 09:00:11 | 09:00:07 |
09:00:04 | 83 | 0 | 09:00:11 | 09:00:09 |
09:00:11 | 62 | 1 | 09:00:11 | 09:00:11 |
09:00:13 | 60 | 0 | 09:00:11 | 09:00:13 |
09:00:15 | 54 | 0 | 09:00:11 | 09:00:15 |
09:00:16 | 29 | 0 | 09:00:11 | 09:00:17 |
Thanks again,
SK
@SK12376574 wrote:
Thanks again for the help.
As for your remarks:
- I'm using the proc sort because the first data step used to merge the open time places the opentime observation as the first row whereas according to its timestamp (09:00:11) it should be in the middle.
- In your reply, you mentioned "you can do that selection in the first data step" - how?
I didn't succeed doing that (I can do it using proc sql).- The code for "filling the holes" got different results than I expected, I would greatly appreciate it if you can elaborate.
Here are the results I'm looking for (the red rows are the "holes" I filled manually):
TimeStamp X OpenFlag opentime fixed_time 09:00:01 98 0 09:00:11 09:00:01 09:00:03 93 0 09:00:11 09:00:03 09:00:04 83 0 09:00:11 09:00:05 09:00:04 83 0 09:00:11 09:00:07 09:00:04 83 0 09:00:11 09:00:09 09:00:11 62 1 09:00:11 09:00:11 09:00:13 60 0 09:00:11 09:00:13 09:00:15 54 0 09:00:11 09:00:15 09:00:16 29 0 09:00:11 09:00:17
Thanks again,
SK
As to your first comment: if you want a copy of the openflag record in it's proper order, then drop the second "where=" parameter. I.e. change
data want;
set trades (where=(openflag=1))
trades (where=(openflag=0));
/*by ticker;*/
to
data want;
set trades (where=(openflag=1))
trades ;
/*by ticker;*/
This allows the openflag=1 record to be read twice: once at the beginning (to extablish opentime) and a second time in chronological order, to provide other data for that trade.
More generally, this is a one-step program:
data Trades;
input TimeStamp time9. X OpenFlag;
Format TimeStamp time9.;
datalines;
09:00:00 61 0
09:00:01 98 0
09:00:02 54 0
09:00:03 93 0
09:00:04 83 0
09:00:10 42 0
09:00:11 62 1
09:00:12 46 0
09:00:13 60 0
09:00:14 92 0
09:00:15 54 0
09:00:16 29 0
;
run;
data want (drop=nxt_: _:);
/* Read the openflag trade, followed by all trades (include 2nd read of openflag trade*/
set trades (where=(openflag=1) in=optrade)
trades (in=alltrades)
end=end_of_tr;
/*by ticker; */
if optrade then do; /* if optrade set opentime and delete this record ... it will be read again*/
opentime=timestamp;
fixed_opentime=round(1+opentime,2);
_open_minus_60=fixed_opentime-60; /*Minimum acceptable fixed_time*/
_open_plus_60=fixed_opentime+60; /*Maximum acceptable fixed_time*/
delete;
end;
retain opentime fixed_opentime _open_minus_60 _open_plus_60;
format opentime fixed_opentime _open_minus_60 _open_plus_60 time8.0;
/* Read ahead one record to get next fixed_time*/
if end_of_tr=0 then set trades (firstobs=2 keep=timestamp rename=(timestamp=nxt_timestamp));
else nxt_timestamp=_open_plus_60+2;
fixed_time=round(1+timestamp,2);
nxt_fixed_time=round(1+nxt_timestamp,2);
format fixed_time nxt_fixed_time time8.0;
if nxt_fixed_time=fixed_time then delete; /*keep only final rec for each fixed_time*/
if nxt_fixed_time<_open_minus_60 then delete; /*drop records that are too early*/
if fixed_time>_open_plus_60 then stop; /*replace "stop" by "delete" if sorted by ticker/timestamp */
do fixed_time=max(fixed_time,_open_minus_60) to min(nxt_fixed_time-2,_open_plus_60) by 2;
output;
end;
run;
If the data set has multiple tickers, (i.e. sorted by ticker/timestamp instead of just timestamp), then you should do 2 things:
I apologize for the delayed response - I fell ill.
Thanks for the complete solution, I highly appreciate it.
SK
Sorry for the hassle, but I just notice the solution gives the wrong fixed_time (every 2 seconds relative to 09:12 instead of 09:11).
I changed the code a bit by subtracting 1 from the fixed_time and fixed_opentime variables (I haven't yet fully understand how the code works), so it seems to give the required result:
Thanks again,
SK
data want (drop=nxt_: _: fixed_opentime);
/* Read the openflag trade, followed by all trades (include 2nd read of openflag trade*/
set trades (where=(openflag=1) in=optrade)
trades (in=alltrades)
end=end_of_tr;
/*by ticker; */
if optrade then do; /* if optrade set opentime and delete this record ... it will be read again*/
opentime=timestamp;
fixed_opentime=round(1+opentime,2)-1;
_open_minus_60=fixed_opentime-60; /*Minimum acceptable fixed_time*/
_open_plus_60=fixed_opentime+60; /*Maximum acceptable fixed_time*/
delete;
end;
retain opentime fixed_opentime _open_minus_60 _open_plus_60;
format opentime fixed_opentime _open_minus_60 _open_plus_60 time8.0;
/* Read ahead one record to get next fixed_time*/
if end_of_tr=0 then set trades (firstobs=2 keep=timestamp rename=(timestamp=nxt_timestamp));
else nxt_timestamp=_open_plus_60+2;
fixed_time=round(1+timestamp,2)-1;
nxt_fixed_time=round(1+nxt_timestamp,2);
format fixed_time nxt_fixed_time time8.0;
if nxt_fixed_time=fixed_time then delete; /*keep only final rec for each fixed_time*/
if nxt_fixed_time<_open_minus_60 then delete; /*drop records that are too early*/
if fixed_time>_open_plus_60 then stop; /*replace "stop" by "delete" if sorted by ticker/timestamp */
do fixed_time=max(fixed_time,_open_minus_60) to min(nxt_fixed_time-2,_open_plus_60) by 2;
output;
end;
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.