Combine datasets based on inexact time

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 14
Accepted Solution

Combine datasets based on inexact time

Dear all,

I have two data sets containing stock trades, theoretically they are about exactly the same trades occurred at the same time (but contain different information about these trades). I wanted to combine the two sets to get the information from both sets.

However, there could exist a small time discrepancy of 1millisecond or more in the two sets. e.g.: (I have attached a small part)

Set A

trade_time

10:10:00.402

10:10:00.404

10:10:00.404

10:10:00.409

Whereas in Set B

Time

10:10:00.402

10:10:00.404

10:10:00.405

10:10:00.409

Set A and Set B are about the exact four trades only sometimes with 1 millisecond (or more) discrepancy.

I attempted to combine the two sets with a code such as:

proc sql;

create table combine as

select *

from A left join

       B

on   B.Time - dhms(0,0,0,0.001) <= A.trade_time <= B.Time + dhms(0,0,0,0.001) and

      A.Code = B.Code and            /* this is just additional constraint needed not too relevant to the question*/

      A.trade_price = B.Price;         /* this is just additional constraint needed not too relevant to the question*/

quit;

But this will create additional trades (rows) and the end result would be like:

Combine

trade_time               Time

10:10:00.402            10:10:00.402

10:10:00.404            10:10:00.404

10:10:00.404            10:10:00.405      

10:10:00.404            10:10:00.404

10:10:00.404            10:10:00.405

10:10:00.409            10:10:00.409

I know that it is logically unlikely to have a perfect one-to-one combine, but can I fix the number of rows in Set A and combine Set B into Set A?

This means even if one row in set A is matched with multiple rows in set B, I would like SAS not duplicate the row from set A but only combine one of the matched row combinations.

Regards,

Hao

Attachment
Attachment

Accepted Solutions
Solution
‎04-27-2013 04:01 AM
Super Contributor
Posts: 644

Re: Combine datasets based on inexact time

@Hao

I've now had a chance to look at your data and may be able to help a little more.

Your dataset A has a sequence number, which in combination with the timestamp may become a key to matching

I note that for the stock and the date/time chosen the average number of complete trades is 0.138 per second, ie the average gap between trades is about 7.3 seconds; however the minimum gap is only 22 milliseconds.

On doing a manual match for the data I note that the difference in timestamps is at most 2 milliseconds, although this sample might be too small to call that definitive.

Here is what I suggest you do:  map the sequence number into table B using a fuzzy match on the timestamps (which is what you were asking for anyway, but I think I now see where it can work for the majority of trades).

My suggestion is that you round the timestamps to the nearest 10 milliseconds and look for a match: if that fails, add 5 milliseconds to each before rounding to catch the odd case where .004 rounds down while .005 rounds up.

The code would look something like this:

Proc SQL ;

     create table match as

          select  distinct

                     tbA.sequence

               ,     tbB.timestamp

          from

               (select distinct

                         sequence

                    ,     timestamp

                    ,     round (tmestamp, .01) as fuzzy1

                    ,     round (tmestamp + .005, .01) as fuzzy2

               from     A

               )     tbA

          ,     (select distinct

                          timestamp

                    ,     round (tmestamp, .01) as fuzzy1

                    ,     round (tmestamp + .005, .01) as fuzzy2

               from     B

               )     tbB

          where      tbA.fuzzy1     =     tbB.fuzzy1

               OR     tbA.fuzzy2     =     tbB.fuzzy2

          Order

               by     sequence

          ;

You need to test that the resulting table contains all the sequence numbers in table A (if not, some timestamps will be more than 5 milliseconds apart: you can either exclude these from your analysis, or interpolate).  Also check whether there are duplicates, more than one timestamp from B matching a sequence number; in which case you might try reducing the rounding to the nearest .008 and adding .004 for fuzzy2.

The match table can now act as a bridge between table A and B.  However you are only half way there because the following join

          create table joined as

               select     A.*

                    ,        B*

               From     A      A

                    ,      B      B

                    ,     match M

               Where A.sequence = M.sequence

                    and B.timestamp = M.timestamp

               ;

will give you a cartesian product for each sequence number:  if there are 5 records in A matching 4 records in B you will get 20 records in your result.

I suspect you will need to do the join in a datastep where you analyse all the records per sequence number before setting the records form B for analysis and comparison.

Hope all this helps

Richard

.

View solution in original post


All Replies
Super User
Posts: 5,435

Re: Combine datasets based on inexact time

The reason for duplicates is the nature of SQL, it's columns oriented. You seem to have a more row-oriented problem, so I think it's hard to make this work using SQL alone.

The whole problem is difficult as well, you don't have any other key than the timestamp?

About the 10:10:00.404 rows in table a, how will you know which of the rows in table b match which row in a?

Can you rely on the initial sort order of the tables? Is there a strict 1-1 relationship between the tables, or is there rows missing in a or b for some cases?

Data never sleeps
Occasional Contributor
Posts: 14

Re: Combine datasets based on inexact time

Thank you for your help, to answer your questions:

you don't have any other key than the timestamp?

I have stock code, time stamp, price, volume

how will you know which of the rows in table b match which row in a?

Price and volume, but volume is not a exact measure because in Set A volume could be 1700 whereas in Set B volume could be two rows of 1000 and 700 (if I combine the 1000 and 700, I might lose information from either row).

Can you rely on the initial sort order of the tables? Is there a strict 1-1 relationship between the tables, or is there rows missing in a or b for some cases?

I do not think so, but I do sort them by Code>Timestamp>price>volume. It is not strict 1-1 due to the issue with volume and possibly something else i don't know.

I would like to add, my research design allow small mismatch in the two sets. I really just want to fix the length (hence the aggregate volume of transaction) of Set A and just combine as much from Set B as possible.

Super Contributor
Posts: 644

Re: Combine datasets based on inexact time

I'm guessing you have buyer info in one file and seller info in the other, but what is missing is the connection between them, ideally the buyer and seller code on the buyer file and the seller code on the seller file.  More than one seller might be needed to fulfil a buy order, but presumably a single sell would not fulfil multiple buyers?  If so you have a many to one situation.

In the absence of this missing information sequence data is all you have.  I suggest you re-examine your original data and determine whether it is strictly sequential, in which case sorting it is a mistake.

Also examine whether the timestamp on a sell can ever be less than the timestamp on a buy, which would increase your problems.

I assume stock code and price should match while timestamp and volume are problematical.

I wonder if you could use proc compare at least to identify and extract the one to one matches, using code price and volume but without changing the sequence.  These could be set aside while you work on the remainder.

A datastep merge with notsorted byvariables would be ideal but I don't think SAS supports it, so I'm thinking that a datastep that sets the buy record, then sets one or more sell records with matching codes and timestamps within n milliseconds until the buy amount is reached might work.

However i suspect there will still be some sales where the seller timestamp is more than a few milliseconds different, if the sell volume on that stock is los.


Hope these musing help.


Richard


Occasional Contributor
Posts: 14

Re: Combine datasets based on inexact time

Posted in reply to RichardinOz

Thanks for the helpful comments, I will check if it is strictly sequential.

The data comes from two different sources, set A contains Bid and Ask sides whereas set B contains only the initiator of a transaction.

However, I am not sure how to code your following comments:

so I'm thinking that a datastep that sets the buy record, then sets one or more sell records with matching codes and timestamps within n milliseconds until the buy amount is reached might work.


Nevertheless, my core problem is the rather bizarre (to me as a beginner) characteristics of proc sql and merge function..

I think my goal can be achieved using other software by conditionally looping Set B Time on top of Set A Time, and combine when the condition met. This way the rows in Set A will not be duplicated...



Super Contributor
Posts: 644

Re: Combine datasets based on inexact time

I see I over simplified by ignoring a transaction could result from an offer to sell as well as an offer to buy. 

I'm not sure what you mean by "the rows in Set A will not be duplicated" because a single bid or ask could result in multiple transactions

Here are some further musings

  1. If, as I hope, your original data is in sequence you should preserve the sequence by adding a sequence number to each table.  You can then sort by stock, volume, timestamp and sequence.
  2. Removing all exact matches (where a bid or ask was satisfied in a single transaction, with matching timestamps), might make the rest of the jigsaw easier to solve.  I would suggest a data step merge because it does not create a cartesian product, but with a test to eliminate many to one matches if one table has more than the other for a given combination.  The problem is that conceivably multiple bidders or askers might be looking for transactions of popular stocks with the same volume at the same time - you could investigate how big a problem that was, and whether it will be a problem if 'mismatches' for multiple simultaneous identical transactions do occur.  The sequence numbers will identify which records to flag or move out of the original tables
  3. I think there is a partial solution for the remainder using set statements to do the looping you suggest but I worry that there might be overlapping Bid/Ask records for the same stock at the same time each involving multiple transactions.  Consider two bids for the same stock at the same time each for 1000, fulfilled by transactions for 500, 400, 600 and 500.  the first and second transactions are too few, while the first 3 are too many to match the first bid.  I can envisage a solution involving a macro for each bid/ask record where matching transactions are removed or flagged in table B before the next table A record is processed.  Not a trivial task



Richard


Occasional Contributor
Posts: 14

Re: Combine datasets based on inexact time

Posted in reply to RichardinOz

Dear Richard,

I have tried your first suggestion, it was unfortunately not the case.

The second suggestion is very practical, I am implementing as we speak.

However, the mismatch in timestamp is very pronounced hence I have to consider your third suggestion despite the difficulty as you have just pointed out. I will give myself this weekend to learn macro........

'Consider two bids for the same stock at the same time each for 1000, fulfilled by transactions for 500, 400, 600 and 500.'

This problem will not be maiming my result since I only need the information of the initiator of each trade. At millisecond resolution, there will be only one initiator most of the time (My manual check confirms 100% in a sample of approx 50 observations).

So, at a single millisecond, the result usually will be: 1. 1000 Ask coming in to take a 400bid and a 600bid.

                                                                             2. 1000 Bid coming in to take a 300Ask , a 200Ask and a 500Ask

Now the trick is, the initiator identifier is in Set B whereas the Bid Ask file is in set A, therefore, I have to loop A against B to bring in the initiator identifier from B to A. e.g.

(Pseudo Code)

Do A.Time;

  Do B.Time;

     If  A.code = B.code and A.price = B.price and B.Time - 0.005s <= A.Time <=B.Time + 0.005s then A.intiator = B.initiator;

  end;

end;

I am aware that A.initiator will inevitably be overlapped by multiple B entries, but I am prepared to accept such mismatch

Hao

Super Contributor
Posts: 644

Re: Combine datasets based on inexact time

I'd like to be more help to you but at the moment I do not have access to a functioning SAS setup.  I should do, from Tuesday, si if you are able to post a few thousand records from each table, covering the same period (preferably as a text file so i can peek at the data this weekend) I might be able to be more help.

Richard

Occasional Contributor
Posts: 14

Re: Combine datasets based on inexact time

Posted in reply to RichardinOz

I have uploaded SAS files in my original post, but I do not think I can add attachments in a reply.

But I am extremely grateful for your offer, have a nice weekend.

Hao

Solution
‎04-27-2013 04:01 AM
Super Contributor
Posts: 644

Re: Combine datasets based on inexact time

@Hao

I've now had a chance to look at your data and may be able to help a little more.

Your dataset A has a sequence number, which in combination with the timestamp may become a key to matching

I note that for the stock and the date/time chosen the average number of complete trades is 0.138 per second, ie the average gap between trades is about 7.3 seconds; however the minimum gap is only 22 milliseconds.

On doing a manual match for the data I note that the difference in timestamps is at most 2 milliseconds, although this sample might be too small to call that definitive.

Here is what I suggest you do:  map the sequence number into table B using a fuzzy match on the timestamps (which is what you were asking for anyway, but I think I now see where it can work for the majority of trades).

My suggestion is that you round the timestamps to the nearest 10 milliseconds and look for a match: if that fails, add 5 milliseconds to each before rounding to catch the odd case where .004 rounds down while .005 rounds up.

The code would look something like this:

Proc SQL ;

     create table match as

          select  distinct

                     tbA.sequence

               ,     tbB.timestamp

          from

               (select distinct

                         sequence

                    ,     timestamp

                    ,     round (tmestamp, .01) as fuzzy1

                    ,     round (tmestamp + .005, .01) as fuzzy2

               from     A

               )     tbA

          ,     (select distinct

                          timestamp

                    ,     round (tmestamp, .01) as fuzzy1

                    ,     round (tmestamp + .005, .01) as fuzzy2

               from     B

               )     tbB

          where      tbA.fuzzy1     =     tbB.fuzzy1

               OR     tbA.fuzzy2     =     tbB.fuzzy2

          Order

               by     sequence

          ;

You need to test that the resulting table contains all the sequence numbers in table A (if not, some timestamps will be more than 5 milliseconds apart: you can either exclude these from your analysis, or interpolate).  Also check whether there are duplicates, more than one timestamp from B matching a sequence number; in which case you might try reducing the rounding to the nearest .008 and adding .004 for fuzzy2.

The match table can now act as a bridge between table A and B.  However you are only half way there because the following join

          create table joined as

               select     A.*

                    ,        B*

               From     A      A

                    ,      B      B

                    ,     match M

               Where A.sequence = M.sequence

                    and B.timestamp = M.timestamp

               ;

will give you a cartesian product for each sequence number:  if there are 5 records in A matching 4 records in B you will get 20 records in your result.

I suspect you will need to do the join in a datastep where you analyse all the records per sequence number before setting the records form B for analysis and comparison.

Hope all this helps

Richard

.

Occasional Contributor
Posts: 14

Re: Combine datasets based on inexact time

Posted in reply to RichardinOz

@Richard

I love your idea about defining a unique ID for each trade, this really allowed me to do what I wanted.

I used Proc SQL much similar to what you have suggested and then got rid of the overlapping entries via the unique trade identifier you have suggested.

The only difference is that instead of using fuzzy times, I record time difference between A.time and B.time and choose the smallest difference whenever there is a overlap.

I have allowed for +- 10ms time difference and got a 99.6% match. I also double checked summary statistics for volume etc pre/post match. They are all matched perfectly.

This has been a blast.

Thank you Richard.

Hao

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 10 replies
  • 914 views
  • 3 likes
  • 3 in conversation