07-24-2014 11:44 AM
I have a financial time series dataset with a frequency of seconds.
To perform statistical analysis I need to search and locate for every record (row), the record which is 5minutes ahead of time.
Then i would like to retain from the record which is 5minutes ahead two observations, the columns bidprc and offerprc and add then next to the original record.
For example if we take from my attached data:
time currency bidprc offerprc dt bidprc5min offerpc5mins
08:07:09 dem/chf 0.86760 0.86770 09JAN97:08:07:09 to be located to be located
I would like to locate above the record which is in at 08:12:09 or if no such record exists the one which is closest to that time.
Is it possible to perform the above search without creating two datasets and e.g using proc sql which is efficient and quick?
Any ideas on how to program the above are more than welcome
07-24-2014 11:59 AM
Does the data ALWAYS have a record at exactly +5 minutes (300 seconds)?
If so something like this may work
create table newdata as
select a.*, b.bidprc as bidprc5min, b.offerprc as offerprc5min
from have as a left join have as b
on a.time = (b.time -300);
07-24-2014 12:29 PM
I will check if the above works. However there isnt a record always at 300 seconds and I would like to make sure that there is always a record in the new tables, so it would be best if the code you proposed can locate the next closest record e.g in 301 seconds or more....
Is it possible to do this?
07-25-2014 08:16 AM
The code you provided generates additional records. The sample has 50000 but the newdata has 68350...any ideas why is that?
also if i run the code on my whole dataset (>1 million records) i get many empty ones, where values have not been located. is it possible to change the code as below so that i can get results for all my records?
on a.time >= (b.time - 300)
07-25-2014 09:24 AM
if you want the next closed record ,why not fetch the next obs ? if your table has already been sorted by DT .
proc import datafile='c:\temp\fixing sample.xls' out=have dbms=xls replace;run; data want; merge have have(firstobs=2 keep=bidprc offerprc rename=(bidprc=bidprc5min offerprc=offerprc5min)); label bidprc5min=' ' offerprc5min=' '; run;
07-28-2014 04:56 AM
the task is to find the record which is 5min ahead in time or if there is no such record then the one which is closest to 5mins....
The code that Ballardw provided was simply an inner join which works only if the data have ALWAYS a matchning record....which is NOT the case so I need some kind of an sql search algorithm...
Any ideas will help
07-28-2014 05:21 AM
Can you assign each row to a grouping variable, for instance actual time vs planned timepoint.
ACTUAL_TIME BASE_TIME GROUP_ASSIGNED GROUPN
12:34 12:34 BASELINE 1
12:36 12:34 BASELINE+5 2
12:37 12:34 BASELINE+5 2
12:43 12:34 BASELINE+10 3
12:47 12:34 BASELINE+15 4
You can then group data into groups using the new variable irrespective of actual time, i.e;
from HAVE FIRST
left join HAVE NEXT
08-01-2014 03:50 AM
well I am nost sure how this will work as the orginal dataset will have 20million rows...re arranging the whole dataset into groups will be difficult...a search algorithm will work easier however I dont know how to program it....
08-01-2014 04:12 AM
Hi Neo, I realy suggest you to try the above mentioned approach. It runs very fast: 0.04 sec on the attached data. It needs only one pass through the data. 20mill rows will be not a problem.
With some more programming it can be extended to support different lookahead and lookback times. Or to support different currencies or more trading days.
08-01-2014 10:10 AM
by above you mean the one you proposed?
I will run it and let you know, however by looking at the code I am almost sure that some records will be empty/unmatched since the records are not at regular intervals, it means there might be no record exactly at 5min ahead but 5minutes and 5 seconds and the code will need to check is there is no record to pick the next closest,in this case at 5mins and 5 seconds....I will let you know as soon as I check yours
08-02-2014 08:37 AM
Yes Neo, I mean my post. It is not a simple merge or a simple interleaving. Look at it carefully. If there is no matching record exactly at 5 minutes, it will use the nearest before 5 minutes. Exactly as you described.
I suggest to look at it for everyone! The beauty of this technique is that it is the most natural approach. You would do the same using pen and paper.
The same approach can be used to handle a lot of problems where you have time stamped data (transactions, dimension tables, SCD2, etc.). Calculating moving averages, moving correlations, "merging" dimension tables and/or fact tables, look-back and look-ahead and more.
It works for regularly and irregularly timestamped data too. It works it there are more transactions on the same timestamp - one of them is chosen (the last one physically).
Is is extremely fast (basically one pass through the data - what could be faster?). There is only one variant of it, which might be slightly faster, but it is a bit more complicated code.
07-28-2014 05:31 AM
Can you change your logic?
It is a bit of strange thinking for future events that has happened already. It could be more easier thinking on calculations that have been happened in the paste.
That is more like stream processing. In that case the whole issue could be easier technical to solve (lag/retain) in am more generic way.
Validations on predictions (modeling) is a possible reason for your question.
08-01-2014 03:51 AM
i am doing both looking in the past but also tryng to 'validate' trading strategies...this is why I want to 'fetch' the records which are ahead (leading) in time fox x minutes
07-28-2014 07:38 AM
Dataset should be sorted by time.
data have_minus5 / view= have_minus5;
time=time-'00:05:00't;/*or use intnx()*/
rename bidprc=bidprc5min offerprc=offerpc5mins;
set have_minus5(in=inH5 keep=time) have(in=inH keep=time);
if inH then do;
if inH5 then do;
Sorry, this is not tested. Tested. And yes, this is like stream processing. ~ one pass through the data.
Be careful when copying apostrophes from this page, they are converted to something else.
I have changed this set statement in the code: set have(in=inH keep=time) have_minus5(in=inH5 keep=time);
It matters which dataset comes first. By changing dataset order in this set statement, you have a very fine control, over handling equal timestamps.
Message was edited by: Gergely Bathó
07-28-2014 10:18 AM
Here is . But it cost me two minutes to run it. Let me know if it could worked.
proc import datafile='c:\temp\fixing sample.xls' out=have dbms=xls replace;run; proc sql; create table want as select a.*,(select bidprc from have where time ge a.time+300 having time eq min(time) ) as bidprc5min , (select offerprc from have where time ge a.time+300 having time eq min(time) ) as offerprc5min from have as a ; quit;