Solved: How to perform a complex merge/matching procedure from two datasets

machete · Posted 12-09-2012 01:01 PM

Any support on this intellectual challenge is very much appreciated

The project:

My project deals with high frequency (tick-by-tick) financial data

The task:

I would like to merge and match two datasets. The first dataset has 2mln observations (financial transactions) and the second has 64mln observations. From the larger dataset I would like to match based on two specific variables (date and time). This means that at the end I will only add 2mln out of 64mln observations as a new variable to the first dataset.

The problem:

The problem I am facing is related to the fact that if no exact match between the “time” variable exists then I would like to take the next closest observation.

In more detail what Iwould like to do:

a- If variable A (currency code) in dataset A equals variable B (currency code) in dataset B, then check if date in dataset A equals date in dataset B, if correct, then check time in dataset A, if time in dataset A equals time in dataset B then (to avoid that there are variables which are not exactly matched because time is measured to the millisecond, if time in dataset A does not equal time in dataset B then take from dataset B the variable which is next closest in time of dataset A, that is t+1) if in dataset A variable C (flow which takes values of 1 or 0) equals 1 then add in dataset A variable D from dataset B, else; add in Dataset A, variable E from Daraset B.

b- Same process as above but this time instead of exact matching based on time, I would like to (match) take the variable that is +30 seconds further in time

Hope you will be able to provide some feedback, I know this not an easy task, I have tried to tackle this issue with colleagues for a couple of weeks now.

Thanks

Neo

Ksharp · Posted 12-17-2012 12:21 AM

I think it is easy for Hash Table.

libname x v9 'd:\software';
data  c2;
 set x.Chfdaily2;
 datetime=input(catx(':',value_date,newtime),anydtdtm20.);
format datetime datetime.;
run;
data c3;
 set x.Chfreuters3;
 datetime=dhms(datepart(Date_G_),0,0,int(Time_G_));
 format datetime datetime.;
run;
data want;
 if _n_ eq 1 then do;
 if 0 then set c3;
  declare hash ha(hashexp:16,dataset:'c3');
   ha.definekey('_ric2','datetime');
   ha.definedata(all:'Y');
   ha.definedone();
 end;
call missing(of _all_);
set c2;
_ric2 =cur1cur2;
datepart=datepart(datetime);
do while(ha.find() ne 0);
 datetime+1;
 if datepart ne datepart(datetime) then leave;
end;
drop datepart;
run;

Ksharp

View solution in original post

PGStats · Posted 12-09-2012 05:11 PM

A small sample of those datasets would help.

PG

art297 · Posted 12-09-2012 06:01 PM

And, in addition to samples of those datasets, it would help to know what kind of system you are on, which version of SAS you are running, and how big each of the two files really are (both in terms of numbers of records and filesize).

TomKari · Posted 12-10-2012 02:15 PM

Hi, and

Isn't this identical to the problem that you, Pierre, solved so brilliantly in https://communities.sas.com/message/146311#146311 (What would be hash equivalent for this sql join?)

Tom

art297 · Posted 12-10-2012 02:26 PM

: It sure sounds like it might be, but we won't know until the OP provides some more information. I posted my response to the OP's question because, in the case of the question I had asked in the thread you mentioned, the actual solution that worked incorporated Pierre's brilliant approach within a somewhat complex design that used a combination of arrays and pointers.

machete · Posted 12-10-2012 03:49 PM

Hi to everybody,

Thanks for your interest to my request

I will try to respond to all of the above questions:

1. I am using SAS 9.3

2. The dataset A has 2mln observations and a size of 23GB and dataset B 64mln observations and size 8GB (they are probably so large because I have no idea how to optimize them, i had dataset A running under another software and it was much smaller)

3. A bit more information on the problem. Dataset A contains 2 mln transactions of a brokerage company on the foreign exchange market during 90 days on an intraday frequency. That is, the transactions of Brokerage XYZ with different customers for 90 days for a specified number of currencies. Dataset B contains 64mln transactions of fx transactions occured on the whole market between all conterparties during those same 90 days. What I would like to do is at a first step to identify for all broker trades(lets say the company i am studying), what is the difference between the price the brokerage transacted with a customer and the price which was prevailing on the market at that point of time. To make things clear assume the brokerage sold EUR/USD at a rate of 1.3150 on the 10.12.2012 at 09:30:45. I would like to look up the EUR/USD price in dataset B on the 10.12.2012 at 09:30:45. If there is no price available at 09:30:45 then the closest available which might be lets say 09:30:50. I would like then to merge the two sets as next to the market price in dataset B there are some more useful variables that I would like to add to the first dataset.

To perform the above I need to match the trades from dataset A to the trades from dataset B at exact times and if no exact match is possible the closest one. At a second stage I would like to match at a specific point in time, e.g time at dataset A+1min, then A+30mins and so forth.(the idea here is to study how the difference between the two prices (market and brokerage price) changes over time)

4. I will need to check whether I can upload any sample of the data as I have signed a confidentiality agreement related to my phd research and the data provider I know it is difficult to fully understand the problem without seeing the data so i will check to see if I can show the basic structure/content here.

5. Somebod mentioned the execution time of a proposed command. I am not looking for an efficient solution so time here will not be an issue, I simply need to solve this programming challenge. Since I am new with SAS and have no advanced skills, working with the mentioned dataset takes me around 15-20mins per command to be executed so I am used to it

6.One more thing. Here is the date structure from the two sets, any idea how to eqalize the date formats?

Dataset A - date variable(dd/mm/yyyy): “05/01/2012” and time variable (hh:mm:ss): “08:45:24”

Dataset B - date variable: “21MAR12”, time variable: “08:45:24”

Thnx

Neo

LinusH · Posted 12-10-2012 05:00 PM

If your tables both are Base SAS, and your data and time columns are numeric - don't concern about formats. They are just for display, and different formats doesn't affect join criteria etc.

About the +1 and +30 logic. Just rerun the original query, but with difference that you add 60 (it's number of seconds) to a.time, and then 1800 (no seconds for 30 minutes).

I guess that an hash table will solve this, but I wouldn't recommend that technique for a not so experienced SAS programmer.

Data never sleeps

Astounding · Posted 12-10-2012 05:14 PM

If you prefer a non-hash solution, a related question would be this. Is the 64M data set sorted? If so, in what order?

art297 · Posted 12-10-2012 05:34 PM

When you say 64 min are you saying 64 million? SAS will run fastest if the two files only contain the minimal amount of information necessary to accomplish the required task.

Given the background details you have provided the real data isn't necessary. For people here to help, just create two sample datasets, with about 2 records in the one and about 20 records in the other (with both only containing the minimum number of variables needed to produce the desired result), as well as the file you want resulting from the program you are asking for help in creating.

Were the date and time fields input as text or as SAS dates and SAS times?

From your description, I think you will want/need a solution much like the one we talked about in the thread that Tom mentioned.

machete · Posted 12-13-2012 05:14 PM

I am not sure I get the difference between a hush and a non hush solution as well as to what is a base SAS table
The large dataset has 64million observations. They are sorted based on currency pair. That is I have all the records of the AUD/USD for 90 days, then follows CHF/USD for 90 days and so forth….
I uploaded a sample of the brokerage dataset (in our discussion dataset A) containing only the trades of 20^th of April and only concerning one currency pair, EUR/CHF. If you have any questions they are welcome. Basic var description which are included: deal id, date, time, spot rate, amount/volume, flow (buy or sell). I have dropped at least 10 variables which are not important for the question we are currently discussing.
I can upload a similar sample with date and currency data from the “market” dataset B. However I have a problem filtering the data because I have the following date format:

'20APR12 00:00:00' and I need first to create a date variable containing the first 7 characters. I tried with the “substr” function but it delivers a number as a result and not “20APR12”. If you have any suggestions here I would be grateful, then I will upload a sample of this dataset.

5. All date and time formats I had in csv or MS access format before uploading to SAS. I believe these were text formats

6. Tomorrow also to follow the sample result sheet that A.Tabachneck mentioned

art297 · Posted 12-13-2012 06:31 PM

It is difficult to recommend a solution without seeing the other file and knowing more about the file you provided.

As for the data being a SAS table, your example and explanation indicate that both files are now SAS files.

Hash (not hush) is a method that may or may not be applicable. At this point, I can't say, but doubt it given the number of records that you have.

In the file you've already provided, all of the dates and times were imported as 255 character variables. My guess is that the ultimate recommendation will be to combine the two fields as a SAS datetime field. That way, you can apply a format that will display them as desired, but still allow a program to access them as one field.

It will definitely help to understand more about the example file you already provided, as well as see an example of the other file. Don't worry about the ultimate format for now.

LinusH · Posted 12-10-2012 07:04 AM

We need some more rules...

Is it so that b.time always should be same or later (not before) a.time?

Do you have a performance constraint (that, should this job be executed multiple times, and within some kind of time frame)?

This example code will probably have long execution time, be sure to have the appropriate indexes defined in both tables.

select a.*, (select time

from b

where a.currCd = b.currCd and

a.date = b.date and

a.time <= b.time

having b,time = max(b.time)

)

from a

;

Data never sleeps

Ksharp · Posted 12-13-2012 09:08 PM

The best way to elaborate your question is to post a smaill sample data(as simple as it could be) and the result you want to see and at the same time explain the logic you have .

Ksharp

machete · Posted 12-16-2012 05:47 PM

Hello

I will try to answer to the questions from above

I uploaded also a sample of the second dataset as well as an excel sheet showing an “example”output of how I would like the data to look after the merge has been implemented.

Below a description of the datasets

DATASET A : Contains our brokers' trades with customers

Deal_Id – a unique deal identifier, i.e the number of each trade

Cur1Cur2 – the currency pair

Amount1Cur1 – the volume of each trade

Spot_Rate – the price/exchange rate at which the deal is executed

Value_Date – the date when the deal is executed

Cpty – the customer with which our broker trades

Flow – it takes values of 1 and -1 if the broker is buying or selling respectively

Newtime – the time of day when the trade was executed in hh:mm:ss structure

RELEVANT DATA for our problem are only value date,cur1cur2, and newtime

DATASET B: Contains all the trades executed on the market – That is it provides information on the prices executed on the whole market, i.e prices offered by other brokers

_RIC – the currency pair

Date_G_ - the date of the trade

Time_G_ - the time of day of the trade

GMT_Offset - not relevant

Type - not relevant

Ex_Cntrb_ID - who is trading (here I extracted prices by only one broker to minimize the size of the dataset)

LOC – not relevant

Price – not relevant

Volume – not relevant

Market_VWAP – not relevant

Bid_Price Price offered for the trade (if customer wants to buy)

Bid_Size – not relevant

Ask_Price Price asked for the trade (if customer wants to sell)

Ask_Size – not relevant

Qualifiers – not relevant

Bid_Yld – not relevant

Ask_Yld – not relevant

Ric_2 – not relevant

_ric2 – the currency pair in comparable format to DATASET A

Midquote – the difference of the bid with the ask price divided by 2

DATA RELEVANT: for our problem relevant are _ric2 has to be the same as cur1cur2, then date_G_ has to be the same as value date and then newtime has to be the same as time_g_. then as a result of add to each the corresponding information of dataset b next to dataset a.In case newtime is not the same as time_g_ please match with the next closest observation of time_g_.

I think this should make things a bit more clear of what i am trying to do in this exercise at a first stage.

THE FINAL OUTPUT EXAMPLE (attached excel sheet): Have in mind that this example shows only the case where no exact match by time will be possible and the next closest value will be taken.

Thanks and a fruitful week to everybody

Cheers

Neo

Ksharp · Posted 12-17-2012 12:21 AM

I think it is easy for Hash Table.

libname x v9 'd:\software';
data  c2;
 set x.Chfdaily2;
 datetime=input(catx(':',value_date,newtime),anydtdtm20.);
format datetime datetime.;
run;
data c3;
 set x.Chfreuters3;
 datetime=dhms(datepart(Date_G_),0,0,int(Time_G_));
 format datetime datetime.;
run;
data want;
 if _n_ eq 1 then do;
 if 0 then set c3;
  declare hash ha(hashexp:16,dataset:'c3');
   ha.definekey('_ric2','datetime');
   ha.definedata(all:'Y');
   ha.definedone();
 end;
call missing(of _all_);
set c2;
_ric2 =cur1cur2;
datepart=datepart(datetime);
do while(ha.find() ne 0);
 datetime+1;
 if datepart ne datepart(datetime) then leave;
end;
drop datepart;
run;

Ksharp

How to perform a complex merge/matching procedure from two datasets

Re: How to perform a complex merge/matching procedure from two datasets

Re: How to perform a complex merge/matching procedure from two datasets

Re: How to perform a complex merge/matching procedure from two datasets

Re: How to perform a complex merge/matching procedure from two datasets

Re: How to perform a complex merge/matching procedure from two datasets

Re: How to perform a complex merge/matching procedure from two datasets

Re: How to perform a complex merge/matching procedure from two datasets

Re: How to perform a complex merge/matching procedure from two datasets

Re: How to perform a complex merge/matching procedure from two datasets

Re: How to perform a complex merge/matching procedure from two datasets

Re: How to perform a complex merge/matching procedure from two datasets

Re: How to perform a complex merge/matching procedure from two datasets

Re: How to perform a complex merge/matching procedure from two datasets

Re: How to perform a complex merge/matching procedure from two datasets

Re: How to perform a complex merge/matching procedure from two datasets

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away