Hi,
I'm writing a thesis using tick-by-tick stock exchange data with "time irregular" trade price observations.
I want to "fix" the observation to periodical observations relative to the trade day open time (denoted as OpenFlag).
I used Proc SQL to do this conversion (see attached below / attached file), but it is very inefficient (my dataset is big with ~500,000 observations each day * 3,000 days * 30 variables).
Can this be done more efficiently using Proc Expand or Proc Timeseries? I already burned one hard disk trying ....
I appriciate any help I can get.
Thanks,
Saggi
Here is the code I use:
/* Create example dataset */
Data Trades;
input TimeStamp time9. X OpenFlag;
Format TimeStamp time9.;
datalines;
09:00:00 61 0
09:00:01 98 0
09:00:02 54 0
09:00:03 93 0
09:00:04 83 0
09:00:10 42 0
09:00:11 62 1
09:00:12 46 0
09:00:13 60 0
09:00:14 92 0
09:00:15 54 0
09:00:16 29 0
;
run;
/*Create Fix 2 seconds time interval table */
Data FT;
set Trades;
if OpenFlag = 1;
rename TimeStamp=OpenTime;
run;
Data FT;
set FT;
Retain OpenTime FTime;
Do i= -60 to 60 by 2;
FTime=OpenTime+i;
output;
end;
Format FTime time9.;
Keep OpenTime FTime;
run;
/* Merge and create a fix time trades dataset */
PROC SQL;
CREATE TABLE WORK.FT_2Sec_Trades AS
SELECT t1.FTime,
t2.TimeStamp,
t2.OpenFlag,
t2.X
FROM FT t1
LEFT JOIN Trades t2 ON (t1.FTime >= t2.TimeStamp)
Group by t1.FTime having t2.TimeStamp = max(t2.TimeStamp)
ORDER BY t1.FTime,
t2.TimeStamp;
QUIT;
For dealing with sorted large datasets, and producing datasets compare the sort key in one record to neighboring records, I have never seen SQL come anywhere close to a well-formed data step:
Data Trades;
input TimeStamp time9. X OpenFlag;
Format TimeStamp time9.;
datalines;
09:00:00 61 0
09:00:01 98 0
09:00:02 54 0
09:00:03 93 0
09:00:04 83 0
09:00:10 42 0
09:00:11 62 1
09:00:12 46 0
09:00:13 60 0
09:00:14 92 0
09:00:15 54 0
09:00:16 29 0
;
run;
data want;
set trades (where=(openflag=1))
trades (where=(openflag=0));
/*by ticker;*/
if openflag=1 then opentime=timestamp;
retain opentime;
format opentime time8.0;
if abs(timestamp-opentime)<=60;
fixed_time= round(1+timestamp-opentime,2);
format fixed_time time8.0;
run;
This program assumes the data are sorted by timestamp. If the data are actually sorted by ticker/time, put in a "by ticker;" statement.
Edits:
If your data is sorted by ticker/timestamp, then use the "by ticker;" statement.
Also this program assumes 1 openflag=1 record per ticker.
Thanks for the prompt reply.
Since I need the code to extrapolate the last known value, I changed the code as following.
As you can see there are two "missing" observations 09:00:07 and 09:00:09 both of them should get the last known value of x=83.
I will appreciate any help in filling out the missing observation in an efficient way (I used proc expand but it is sort of a black box for me).
Thanks,
SK
Data Trades;
input TimeStamp time9. X OpenFlag;
Format TimeStamp time9.;
datalines;
09:00:00 61 0
09:00:01 98 0
09:00:02 54 0
09:00:03 93 0
09:00:04 83 0
09:00:10 42 0
09:00:11 62 1
09:00:12 46 0
09:00:13 60 0
09:00:14 92 0
09:00:15 54 0
09:00:16 29 0
;
run;
data want;
set trades (where=(openflag=1))
trades (where=(openflag=0));
if openflag=1 then opentime=timestamp;
retain opentime;
format opentime time8.0;
fixed_time=opentime+ceil((timestamp-opentime)/2)*2;
format fixed_time time8.0;
run;
Proc Sort data=want;
by fixed_time timestamp;
run;
Data Want;
set want;
by fixed_time timestamp;
if last.fixed_time;
run;
x
As to holes in your timestamps, you can change the DATA WANT step to
data want (drop=nxt_:);
merge need
need (firstobs=2 keep=fixed_time rename=(fixed_time=nxt_fixed_time)) end=end_of_merge;
if end_of_merge=0 then do fixed_time=fixed_time to nxt_fixed_time-2 by 2;
output;
end;
run;
run;
Thanks again for the help.
As for your remarks:
Here are the results I'm looking for (the red rows are the "holes" I filled manually):
TimeStamp | X | OpenFlag | opentime | fixed_time |
09:00:01 | 98 | 0 | 09:00:11 | 09:00:01 |
09:00:03 | 93 | 0 | 09:00:11 | 09:00:03 |
09:00:04 | 83 | 0 | 09:00:11 | 09:00:05 |
09:00:04 | 83 | 0 | 09:00:11 | 09:00:07 |
09:00:04 | 83 | 0 | 09:00:11 | 09:00:09 |
09:00:11 | 62 | 1 | 09:00:11 | 09:00:11 |
09:00:13 | 60 | 0 | 09:00:11 | 09:00:13 |
09:00:15 | 54 | 0 | 09:00:11 | 09:00:15 |
09:00:16 | 29 | 0 | 09:00:11 | 09:00:17 |
Thanks again,
SK
@SK12376574 wrote:
Thanks again for the help.
As for your remarks:
- I'm using the proc sort because the first data step used to merge the open time places the opentime observation as the first row whereas according to its timestamp (09:00:11) it should be in the middle.
- In your reply, you mentioned "you can do that selection in the first data step" - how?
I didn't succeed doing that (I can do it using proc sql).- The code for "filling the holes" got different results than I expected, I would greatly appreciate it if you can elaborate.
Here are the results I'm looking for (the red rows are the "holes" I filled manually):
TimeStamp X OpenFlag opentime fixed_time 09:00:01 98 0 09:00:11 09:00:01 09:00:03 93 0 09:00:11 09:00:03 09:00:04 83 0 09:00:11 09:00:05 09:00:04 83 0 09:00:11 09:00:07 09:00:04 83 0 09:00:11 09:00:09 09:00:11 62 1 09:00:11 09:00:11 09:00:13 60 0 09:00:11 09:00:13 09:00:15 54 0 09:00:11 09:00:15 09:00:16 29 0 09:00:11 09:00:17
Thanks again,
SK
As to your first comment: if you want a copy of the openflag record in it's proper order, then drop the second "where=" parameter. I.e. change
data want;
set trades (where=(openflag=1))
trades (where=(openflag=0));
/*by ticker;*/
to
data want;
set trades (where=(openflag=1))
trades ;
/*by ticker;*/
This allows the openflag=1 record to be read twice: once at the beginning (to extablish opentime) and a second time in chronological order, to provide other data for that trade.
More generally, this is a one-step program:
data Trades;
input TimeStamp time9. X OpenFlag;
Format TimeStamp time9.;
datalines;
09:00:00 61 0
09:00:01 98 0
09:00:02 54 0
09:00:03 93 0
09:00:04 83 0
09:00:10 42 0
09:00:11 62 1
09:00:12 46 0
09:00:13 60 0
09:00:14 92 0
09:00:15 54 0
09:00:16 29 0
;
run;
data want (drop=nxt_: _:);
/* Read the openflag trade, followed by all trades (include 2nd read of openflag trade*/
set trades (where=(openflag=1) in=optrade)
trades (in=alltrades)
end=end_of_tr;
/*by ticker; */
if optrade then do; /* if optrade set opentime and delete this record ... it will be read again*/
opentime=timestamp;
fixed_opentime=round(1+opentime,2);
_open_minus_60=fixed_opentime-60; /*Minimum acceptable fixed_time*/
_open_plus_60=fixed_opentime+60; /*Maximum acceptable fixed_time*/
delete;
end;
retain opentime fixed_opentime _open_minus_60 _open_plus_60;
format opentime fixed_opentime _open_minus_60 _open_plus_60 time8.0;
/* Read ahead one record to get next fixed_time*/
if end_of_tr=0 then set trades (firstobs=2 keep=timestamp rename=(timestamp=nxt_timestamp));
else nxt_timestamp=_open_plus_60+2;
fixed_time=round(1+timestamp,2);
nxt_fixed_time=round(1+nxt_timestamp,2);
format fixed_time nxt_fixed_time time8.0;
if nxt_fixed_time=fixed_time then delete; /*keep only final rec for each fixed_time*/
if nxt_fixed_time<_open_minus_60 then delete; /*drop records that are too early*/
if fixed_time>_open_plus_60 then stop; /*replace "stop" by "delete" if sorted by ticker/timestamp */
do fixed_time=max(fixed_time,_open_minus_60) to min(nxt_fixed_time-2,_open_plus_60) by 2;
output;
end;
run;
If the data set has multiple tickers, (i.e. sorted by ticker/timestamp instead of just timestamp), then you should do 2 things:
I apologize for the delayed response - I fell ill.
Thanks for the complete solution, I highly appreciate it.
SK
Sorry for the hassle, but I just notice the solution gives the wrong fixed_time (every 2 seconds relative to 09:12 instead of 09:11).
I changed the code a bit by subtracting 1 from the fixed_time and fixed_opentime variables (I haven't yet fully understand how the code works), so it seems to give the required result:
Thanks again,
SK
data want (drop=nxt_: _: fixed_opentime);
/* Read the openflag trade, followed by all trades (include 2nd read of openflag trade*/
set trades (where=(openflag=1) in=optrade)
trades (in=alltrades)
end=end_of_tr;
/*by ticker; */
if optrade then do; /* if optrade set opentime and delete this record ... it will be read again*/
opentime=timestamp;
fixed_opentime=round(1+opentime,2)-1;
_open_minus_60=fixed_opentime-60; /*Minimum acceptable fixed_time*/
_open_plus_60=fixed_opentime+60; /*Maximum acceptable fixed_time*/
delete;
end;
retain opentime fixed_opentime _open_minus_60 _open_plus_60;
format opentime fixed_opentime _open_minus_60 _open_plus_60 time8.0;
/* Read ahead one record to get next fixed_time*/
if end_of_tr=0 then set trades (firstobs=2 keep=timestamp rename=(timestamp=nxt_timestamp));
else nxt_timestamp=_open_plus_60+2;
fixed_time=round(1+timestamp,2)-1;
nxt_fixed_time=round(1+nxt_timestamp,2);
format fixed_time nxt_fixed_time time8.0;
if nxt_fixed_time=fixed_time then delete; /*keep only final rec for each fixed_time*/
if nxt_fixed_time<_open_minus_60 then delete; /*drop records that are too early*/
if fixed_time>_open_plus_60 then stop; /*replace "stop" by "delete" if sorted by ticker/timestamp */
do fixed_time=max(fixed_time,_open_minus_60) to min(nxt_fixed_time-2,_open_plus_60) by 2;
output;
end;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.