Re: How to perform a complex merge/matching procedure from two dataset...

machete · Posted 02-12-2013 12:39 PM

Dear all,

I am posting two new questions which are based on an earlier inquiry which was partly answered regarding a complex merging procedure:

https://communities.sas.com/thread/39361?start=0&tstart=0

The two new questions are based on the provided solution by Ksharp:

1. The basic solution shown also below matches the same currencies based on a specific date and time interval. If there exists no exact match by time, the next closest time is selected to do the matching (thus time+1 as in code). What I would like is to update this code to perform the same procedure as before instead but now instead of having time A = time B (or time B+1) to have time A = time B+60 (or time B+60 + 1).

2. The solution is provided under hash format and when executing one currency for one day it works well. However my dataset has multiple currencies and 68 days of observations (in total 64 million of observations in the first dataset and 2mln in the second dataset) which makes the execution impossible by my pc. Do you have any other ideas on how efficiently to execute the hash code?

Any suggestions are very much welcome.

Many thanks for your time in advance

Best regards

Neo

Below the hash solution

libname'd:\phd thesis\sas files';

data neo.c2;

set neo.Chfdaily2;

datetime=input(catx(':',value_date,newtime),anydtdtm20.);

format datetime datetime.;

run;

data neo.c3;

set neo.Chfr;

datetime=dhms(datepart(Date_G_),0,0,int(Time_G_));

format datetime datetime.;

run;

data neo.want;

if _n_ eq 1 then do;

if 0 then set c3;

declare hash ha(hashexp:16,dataset:'c3');

ha.definekey('_ric2','datetime');

ha.definedata(all:'Y');

ha.definedone();

end;

call missing(of _all_);

set c2;

_ric2 =cur1cur2;

datepart=datepart(datetime);

do while(ha.find() ne 0);

datetime+1;

if datepart ne datepart(datetime) then leave;

end;

drop datepart LOC Price Volume;

run;

art297 · Posted 02-12-2013 05:38 PM

It will help others to try to answer if you can express what you want to accomplish a bit differently. I.e., without referring to KSharp's code, or references like 60+1, what are you really trying to accomplish?

machete · Posted 02-14-2013 02:40 AM

Hi,

So we deal with the following economic problem:

- we have two datasets. The first one contains the transactions of a broker with customers. the second dataset contains concluded transactions on the market (where only brokers have access but not customers).

- what i try to do is to compare the difference in the price the customer has paid to trade with the broker and the price which at that time was prevailing at the market.

Is it more clear now?

That is why we need to match from the two datasets the currencies if they are the same (since both datasets have transactions of several currencies), then look at the dates and match the same dates and as last step match the transaction times - they should also be the same. Now comes the tricky part:

1 Because the broker after transacting with the customer might not trade exactly the same time with other brokers but after 1 second, the first step of the solution as you see in the hash solution checks if time A does not euqal time B, it picks time B+1. This is the first part of the problem

2 Now imagine I want to directly match the transaction a broker executed with a customer with the price which was on the market (dataset B) 60 seconds after the customer transaction. However as before, there might exist no price in exactly +60 seconds, so the code must account for that and searchh for 60seconds + X, that is take exactly the nearest observation.

No2 of the problem as described above does not work in the hash solution

For No1 and potentially no2 the hash solution is not efficient as a method to execute a large dataset

Let me know if you need any additional info and look forward to your reply. Attached are also excel sheets with sample data

Thanks

Neo

art297 · Posted 02-14-2013 07:17 PM

No, not clear enough for me. If dataset A has the values that one paid at a particular point in time, and dataset B contains the prices at various points in time, I would think that for each record in dataset A you would want the record from dataset B that has a time that occured closest to but before the time reflected in the particular dataset A record.

machete · Posted 02-15-2013 03:41 PM

Let me me explain further:

Imagine there exist two parallel markets, market A (dataset A) where customers are allowed to trade with brokers(e.g banks) and market B (Dataset B) where only brokers are allowed to trade between themselves.

Now lets say we would like to test whether some customers are better informed than their broker. We check how much they paid when they transacted with the broker and then we check the price the broker paid when trading with other brokers, exactly the same time or later during the day depending on the investment horizon. A numeric example.

Customer A (is informed)

Buys from our broker (dataset A) an asset XYZ at $5 at 10.00am on 07.02.2013

Our broker sells to other brokers (dataset B) the same asset as $6 at 10.01 on 07.02.2013

Our broker also sells to other brokers (dataset B) the same asset at $7 at 11.30 on the 07.02.2013

What is the price impact of the customer (simply put the delta between the first trade with the broker and any of the other trades between the brokers?)

From the above we can analyze first of all whether the customer is more informed about future price changes than his broker(he knows the direction of price change, e.g buys before the price goes high) and of course up to what time horizon he has information. Is it intraday for a couple of minutes? hours? or it is days?

For this reason I need to program the matching code at different time horizons. This variable (price impact) will be my dependent variable

I hope this makes things clearer?

Neo

art297 · Posted 02-15-2013 05:22 PM

Not for me! You don't say what you think will constitute the "other trades" and you seem to be talking about specific brokers but, at the same time, generalizing to the market place. I would think, again, that you want the most recent broker price for the asset before the datasetA record, as well as all of those within a certain period after the trade.

If that is what you are trying to do, then you have to let us know and what that after period should be.

machete · Posted 02-16-2013 07:09 AM

Actually I would like to take the broker price executed at the same time as the record in dataset A.if not possible then use the broker price in time+1(the next closest). The other broker trades will have to be after the trade record in dataset A in specific time intervals. for example executed time in dataset A + 30 seconds or 60 or 90 etc. I would imagine the solution code as a standard solution then me being able to modify the periods of analysis(30 or 60 or 90 secods). this solution should deliver values in specific points in time.

Let me restate the problem in a similar way

.

1. I would like to match based on a trade for example EUR/USD (asset) from dataset A executed at a specific point in time (date and time), with a trade of EUR/USD (same asset as in dataset A) which is part of dataset B and is executed at the same point of time (date and time) with the trade in dataset A. If such trade to be matched at exactly the same point of time does not exist, to simply search and then match the closest trade available (e.g time +1). This is what the hash solution actually did.

That is, is there a more efficient code to perform the above?

2. The second exercise is to do exactly the same process as above, however this time I would like to match the trade in dataset B which is X seconds after the trade in dataset A. that means that we will match trade A at a specified time and date (lets say 10:00 16/02/2013 ) with a trade in dataset B which is the same date, but time+60 seconds (lets say 10:01 16/02/2013). if there does not exist a perfect match at time+60seconds to take the closest one, e.g time+60seconds+1(e.g 10:01:01 16/02/2013)

In the files attached there is also available a sheet which shows how the final matching should look like

Let me know how I can support further in explaining this exercise

Thanks

Neo

art297 · Posted 02-16-2013 09:44 AM

Neo,

Your 'want' file doesn't appear to match your description. From your description, I would have expected to find at least one record in your want file for every record that exists in dataset A. However, the first one there IIRC, was record number six. Why?

machete · Posted 02-16-2013 12:52 PM

Hi,

The want file is structured as follows: up to variable datetime (on the horizontal axis) these are dataset B variables and then follow dataset A variables. Datasets A and B follow the notation of our discussion from above

As you correctly observed the correctly matched observations start from observation no 6. The first 5 observations which are not matched I believe this is the case because the example files of dataset B which I provided was too short, thus did not had enough observations (i wanted to upload only a small part of the whole dataset). However the complete dataset B will not face such problems. For illustration, dataset A has 2 millions observations for the same time period which dataset B has 64 million observations.

art297 · Posted 02-16-2013 01:29 PM

My question probably wasn't sufficiently clear. What is the criterion for a record not matching? I ask because I would suggest code that gets all of the relevant data in one step. Within a minute? An hour? And, as for multiple matches, does that change the criterion or do you want all matches within a particular time period?

machete · Posted 02-16-2013 03:39 PM

Hi Art,

records need to fullfill the following criteria to be ,matched :

1. same currency(asset)

2. same date

3. same time

in general records sometimes do not match because time is measured up to the second (this is too detailed) and sometimes there exist no exact matches by the second in both datasets(this is why i would like the code to search for the next closest observation in terms of seconds.

in the example discussed above with the "want" file, the problem was that in dataset B there were no trades after 21:00 so for the trade executed in dataset A at 21:50 there was no matching trade in dataset B. this was because I added a short example (small sample) of the two datasets.

the above will net only one new variable (matched exactly by time or in the worst case matched by time+1)

the second matching exercise will search for a solution time in A = time in B+ 30 seconds, +60 seconds, 30min, 60mins, 720mins,24hrs (this should add 6 variables). I can though adjust the code from above and execute it 6 times to give each time one solution) . This solution will be more tricky so lets discuss in due time.

art297 · Posted 02-16-2013 04:05 PM

Neo,

You're getting closer but, like I said, I would suggest getting everything done in one run. Does time+1 equal time+1 second?

If so, then what I would suggest is outputting the info for all six matches (i.e., the average price for those records which match:

(1) time <=time+1second

(2) time+1 second <= time+30 seconds

(3) time + 30 seconds <= time+30 minutes

(4) time+30 minutes <= time+60 minutes

(5) time+60 minutes <= to,e+720 minutes

(6) time+720 minutes <= time+24 hours

Would that suffice?

machete · Posted 02-16-2013 04:52 PM

I refer to Time+1 as the closest observation next to time. That is, if time in dataset A does not equal time in B then +1 in my wording meant simply the next closest observation, this can be 1 second or 10 seconds or 2mins,dependng on dataset B.Some currencies (assets) are more liquid than others, that means the next closest observation might be after one second, but for currencies of small countries the next closest trade might be in 2 mins as these currencies are not trade so often.

the base for all the matches will be time in dataset A, I modified slightly what you wrote:

(1) time(in dataset A) <=time(in dataset B)+1second

however for the next observations things change:

(2) time <= time+30 seconds

(3) time <= time+1 minutes

(4) time <= time+30 minutes

(5) time <= time+60 minutes

(6) time <= time+24 hours

as you can see mathematically i have the price from dataset A and the development of the price in time based on data available in dataset B.

at the end i will get a row e.g

(dataset A) (dataset B)

trade executed price time price2 time2 price3 time3 ........... price6 time 6

1 1.35 22:36:00 1.37 22:36:00 1.38 22:36:30 1.36 22:36:00 (however this is after 24 hrs - the date has changed)

2 ............................

3 ..........................

.....

2mln ............................

I think it is more difficult to do all matches as once. This is shown in my example. Have a look at price 6/time 6. As this query is conditional:

a/ on the currency variable (that both currency codes are the same in dataset A and B)

b/ on the date

However if we search for a trade + 24hrs or in the above example +2 hours based on the trade which is executed at 22:36 then the dates will be not the same. I am not good in programming but I think this will be tricky to solve as the basic solution is conditional on that the dates are the same. if we do not condition on date in the beginning, the solution will match based on time and net wrong results because the dataset is a time series of 3 months (total of 68 working days) thus practically we will have an exact solution at 22:36 for 68 observations contained in dataset B.

Hope I did not confuse you with the above

art297 · Posted 02-16-2013 06:24 PM

Neo,

Since you have both date and time in one file, and a datetime field in the other, I would just combine the two in the first file to have a datetime field in both. That way, you only have to compare datetime differences and whether they cross dates would be irrelevant.

If I can find the time to offer some code, it will be to solve everything at once. How you use it, of course, would be your decision.

Your requirement for the initial "match" should change, I think, the rest of the ranges. I.e., if first match is the closest time, then I would think that you would want the ranges to appear something like:

(1) datetime(in dataset A) <= the closest datetime in dataset B)

new_base_datetime = the time found in this step

(2) new_base_datetime+1/100th second <= next datetime(s) in datasetB<=new_base_datetime+30 seconds

(3) new_base_datetime+30 seconds+1/100th second <= next datetime(s) in datasetB<=new_base_datetime+1 minute

(4) new_base_datetime+1 minute+1/100th second <= next datetime(s) in datasetB<=new_base_datetime+30 minutes

(5) new_base_datetime+30 minutes+1/100th second <= next datetime(s) in datasetB<=new_base_datetime+60 minutes

(6) new_base_datetime+60 minutes+1/100th second <= next datetime(s) in datasetB<=new_base_datetime+24 hours

Would that work or do I still not understand what you are trying to do?

machete · Posted 02-16-2013 07:00 PM

Hi Art,

What you mention rgd datetime makes sense then this should not be a problem.

The solution will be actually easier than what you propose. The base time is fixed. That means for all the matches To (starting time) is constant and it is the datetime of a transaction in dataset A. we then simply match this one with those in dataset B which are +30seconds, +1...+24hrs further in time. as you have written it above equations (1) to (6), the left handside should be always datetime (in dataset A)

in summary:

(1) datetime(in dataset A) <= the closest datetime in dataset B)

(2) datetime(in dataset A) <= database B_datetime+30 seconds

(3) datetime(in dataset A) <= database B_datetime+1 minute

(4) datetime(in dataset A) <= database B_datetime+30 minutes

(5) datetime(in dataset A) <= database B_datetime+60 minutes

(6) datetime(in dataset A) <= database B_datetime+24 hours

does it make sense?

all the above conditioned on currency in database A = currency in database B (this is a text variable, i am not sure if this makes a difference for the code)

i am not sure if we need this part you provided in all the equations:

<= next datetime(s) in datasetB ?

I have the feeling we are close to have the solution described in a logical order.

Thanks

Neo

How to perform a complex merge/matching procedure from two datasets (Part B)

Re: How to perform a complex merge/matching procedure from two datasets (Part B)

Re: How to perform a complex merge/matching procedure from two datasets (Part B)

Re: How to perform a complex merge/matching procedure from two datasets (Part B)

Re: How to perform a complex merge/matching procedure from two datasets (Part B)

Re: How to perform a complex merge/matching procedure from two datasets (Part B)

Re: How to perform a complex merge/matching procedure from two datasets (Part B)

Re: How to perform a complex merge/matching procedure from two datasets (Part B)

Re: How to perform a complex merge/matching procedure from two datasets (Part B)

Re: How to perform a complex merge/matching procedure from two datasets (Part B)

Re: How to perform a complex merge/matching procedure from two datasets (Part B)

Re: How to perform a complex merge/matching procedure from two datasets (Part B)

Re: How to perform a complex merge/matching procedure from two datasets (Part B)

Re: How to perform a complex merge/matching procedure from two datasets (Part B)

Re: How to perform a complex merge/matching procedure from two datasets (Part B)

Registration is open

SAS Training: Just a Click Away