Any support on this intellectual challenge is very much appreciated
The project:
My project deals with high frequency (tick-by-tick) financial data
The task:
I would like to merge and match two datasets. The first dataset has 2mln observations (financial transactions) and the second has 64mln observations. From the larger dataset I would like to match based on two specific variables (date and time). This means that at the end I will only add 2mln out of 64mln observations as a new variable to the first dataset.
The problem:
The problem I am facing is related to the fact that if no exact match between the “time” variable exists then I would like to take the next closest observation.
In more detail what Iwould like to do:
a- If variable A (currency code) in dataset A equals variable B (currency code) in dataset B, then check if date in dataset A equals date in dataset B, if correct, then check time in dataset A, if time in dataset A equals time in dataset B then (to avoid that there are variables which are not exactly matched because time is measured to the millisecond, if time in dataset A does not equal time in dataset B then take from dataset B the variable which is next closest in time of dataset A, that is t+1) if in dataset A variable C (flow which takes values of 1 or 0) equals 1 then add in dataset A variable D from dataset B, else; add in Dataset A, variable E from Daraset B.
b- Same process as above but this time instead of exact matching based on time, I would like to (match) take the variable that is +30 seconds further in time
Hope you will be able to provide some feedback, I know this not an easy task, I have tried to tackle this issue with colleagues for a couple of weeks now.
Thanks
Neo
I think it is easy for Hash Table.
libname x v9 'd:\software'; data c2; set x.Chfdaily2; datetime=input(catx(':',value_date,newtime),anydtdtm20.); format datetime datetime.; run; data c3; set x.Chfreuters3; datetime=dhms(datepart(Date_G_),0,0,int(Time_G_)); format datetime datetime.; run; data want; if _n_ eq 1 then do; if 0 then set c3; declare hash ha(hashexp:16,dataset:'c3'); ha.definekey('_ric2','datetime'); ha.definedata(all:'Y'); ha.definedone(); end; call missing(of _all_); set c2; _ric2 =cur1cur2; datepart=datepart(datetime); do while(ha.find() ne 0); datetime+1; if datepart ne datepart(datetime) then leave; end; drop datepart; run;
Ksharp
A small sample of those datasets would help.
PG
And, in addition to samples of those datasets, it would help to know what kind of system you are on, which version of SAS you are running, and how big each of the two files really are (both in terms of numbers of records and filesize).
Isn't this identical to the problem that you, Pierre, solved so brilliantly in https://communities.sas.com/message/146311#146311 (What would be hash equivalent for this sql join?)
Tom
: It sure sounds like it might be, but we won't know until the OP provides some more information. I posted my response to the OP's question because, in the case of the question I had asked in the thread you mentioned, the actual solution that worked incorporated Pierre's brilliant approach within a somewhat complex design that used a combination of arrays and pointers.
Hi to everybody,
Thanks for your interest to my request
I will try to respond to all of the above questions:
1. I am using SAS 9.3
2. The dataset A has 2mln observations and a size of 23GB and dataset B 64mln observations and size 8GB (they are probably so large because I have no idea how to optimize them, i had dataset A running under another software and it was much smaller)
3. A bit more information on the problem. Dataset A contains 2 mln transactions of a brokerage company on the foreign exchange market during 90 days on an intraday frequency. That is, the transactions of Brokerage XYZ with different customers for 90 days for a specified number of currencies. Dataset B contains 64mln transactions of fx transactions occured on the whole market between all conterparties during those same 90 days. What I would like to do is at a first step to identify for all broker trades(lets say the company i am studying), what is the difference between the price the brokerage transacted with a customer and the price which was prevailing on the market at that point of time. To make things clear assume the brokerage sold EUR/USD at a rate of 1.3150 on the 10.12.2012 at 09:30:45. I would like to look up the EUR/USD price in dataset B on the 10.12.2012 at 09:30:45. If there is no price available at 09:30:45 then the closest available which might be lets say 09:30:50. I would like then to merge the two sets as next to the market price in dataset B there are some more useful variables that I would like to add to the first dataset.
To perform the above I need to match the trades from dataset A to the trades from dataset B at exact times and if no exact match is possible the closest one. At a second stage I would like to match at a specific point in time, e.g time at dataset A+1min, then A+30mins and so forth.(the idea here is to study how the difference between the two prices (market and brokerage price) changes over time)
4. I will need to check whether I can upload any sample of the data as I have signed a confidentiality agreement related to my phd research and the data provider I know it is difficult to fully understand the problem without seeing the data so i will check to see if I can show the basic structure/content here.
5. Somebod mentioned the execution time of a proposed command. I am not looking for an efficient solution so time here will not be an issue, I simply need to solve this programming challenge. Since I am new with SAS and have no advanced skills, working with the mentioned dataset takes me around 15-20mins per command to be executed so I am used to it
6.One more thing. Here is the date structure from the two sets, any idea how to eqalize the date formats?
Dataset A - date variable(dd/mm/yyyy): “05/01/2012” and time variable (hh:mm:ss): “08:45:24”
Dataset B - date variable: “21MAR12”, time variable: “08:45:24”
Thnx
Neo
If your tables both are Base SAS, and your data and time columns are numeric - don't concern about formats. They are just for display, and different formats doesn't affect join criteria etc.
About the +1 and +30 logic. Just rerun the original query, but with difference that you add 60 (it's number of seconds) to a.time, and then 1800 (no seconds for 30 minutes).
I guess that an hash table will solve this, but I wouldn't recommend that technique for a not so experienced SAS programmer.
If you prefer a non-hash solution, a related question would be this. Is the 64M data set sorted? If so, in what order?
When you say 64 min are you saying 64 million? SAS will run fastest if the two files only contain the minimal amount of information necessary to accomplish the required task.
Given the background details you have provided the real data isn't necessary. For people here to help, just create two sample datasets, with about 2 records in the one and about 20 records in the other (with both only containing the minimum number of variables needed to produce the desired result), as well as the file you want resulting from the program you are asking for help in creating.
Were the date and time fields input as text or as SAS dates and SAS times?
From your description, I think you will want/need a solution much like the one we talked about in the thread that Tom mentioned.
'20APR12 00:00:00' and I need first to create a date variable containing the first 7 characters. I tried with the “substr” function but it delivers a number as a result and not “20APR12”. If you have any suggestions here I would be grateful, then I will upload a sample of this dataset.
5. All date and time formats I had in csv or MS access format before uploading to SAS. I believe these were text formats
6. Tomorrow also to follow the sample result sheet that A.Tabachneck mentioned
It is difficult to recommend a solution without seeing the other file and knowing more about the file you provided.
As for the data being a SAS table, your example and explanation indicate that both files are now SAS files.
Hash (not hush) is a method that may or may not be applicable. At this point, I can't say, but doubt it given the number of records that you have.
In the file you've already provided, all of the dates and times were imported as 255 character variables. My guess is that the ultimate recommendation will be to combine the two fields as a SAS datetime field. That way, you can apply a format that will display them as desired, but still allow a program to access them as one field.
It will definitely help to understand more about the example file you already provided, as well as see an example of the other file. Don't worry about the ultimate format for now.
We need some more rules...
Is it so that b.time always should be same or later (not before) a.time?
Do you have a performance constraint (that, should this job be executed multiple times, and within some kind of time frame)?
This example code will probably have long execution time, be sure to have the appropriate indexes defined in both tables.
select a.*, (select time
from b
where a.currCd = b.currCd and
a.date = b.date and
a.time <= b.time
having b,time = max(b.time)
)
from a
;
The best way to elaborate your question is to post a smaill sample data(as simple as it could be) and the result you want to see and at the same time explain the logic you have .
Ksharp
Hello
I will try to answer to the questions from above
I uploaded also a sample of the second dataset as well as an excel sheet showing an “example”output of how I would like the data to look after the merge has been implemented.
Below a description of the datasets
DATASET A : Contains our brokers' trades with customers
Deal_Id – a unique deal identifier, i.e the number of each trade |
Cur1Cur2 – the currency pair |
Amount1Cur1 – the volume of each trade |
Spot_Rate – the price/exchange rate at which the deal is executed |
Value_Date – the date when the deal is executed |
Cpty – the customer with which our broker trades |
Flow – it takes values of 1 and -1 if the broker is buying or selling respectively |
Newtime – the time of day when the trade was executed in hh:mm:ss structure RELEVANT DATA for our problem are only value date,cur1cur2, and newtime DATASET B: Contains all the trades executed on the market – That is it provides information on the prices executed on the whole market, i.e prices offered by other brokers |
_RIC – the currency pair |
Date_G_ - the date of the trade |
Time_G_ - the time of day of the trade |
GMT_Offset - not relevant |
Type - not relevant |
Ex_Cntrb_ID - who is trading (here I extracted prices by only one broker to minimize the size of the dataset) |
LOC – not relevant |
Price – not relevant |
Volume – not relevant |
Market_VWAP – not relevant |
Bid_Price Price offered for the trade (if customer wants to buy) |
Bid_Size – not relevant |
Ask_Price Price asked for the trade (if customer wants to sell) |
Ask_Size – not relevant |
Qualifiers – not relevant |
Bid_Yld – not relevant |
Ask_Yld – not relevant |
Ric_2 – not relevant |
_ric2 – the currency pair in comparable format to DATASET A |
Midquote – the difference of the bid with the ask price divided by 2 |
DATA RELEVANT: for our problem relevant are _ric2 has to be the same as cur1cur2, then date_G_ has to be the same as value date and then newtime has to be the same as time_g_. then as a result of add to each the corresponding information of dataset b next to dataset a.In case newtime is not the same as time_g_ please match with the next closest observation of time_g_.
I think this should make things a bit more clear of what i am trying to do in this exercise at a first stage.
THE FINAL OUTPUT EXAMPLE (attached excel sheet): Have in mind that this example shows only the case where no exact match by time will be possible and the next closest value will be taken.
Thanks and a fruitful week to everybody
Cheers
Neo
I think it is easy for Hash Table.
libname x v9 'd:\software'; data c2; set x.Chfdaily2; datetime=input(catx(':',value_date,newtime),anydtdtm20.); format datetime datetime.; run; data c3; set x.Chfreuters3; datetime=dhms(datepart(Date_G_),0,0,int(Time_G_)); format datetime datetime.; run; data want; if _n_ eq 1 then do; if 0 then set c3; declare hash ha(hashexp:16,dataset:'c3'); ha.definekey('_ric2','datetime'); ha.definedata(all:'Y'); ha.definedone(); end; call missing(of _all_); set c2; _ric2 =cur1cur2; datepart=datepart(datetime); do while(ha.find() ne 0); datetime+1; if datepart ne datepart(datetime) then leave; end; drop datepart; run;
Ksharp
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.