Solved: Re: Merging two datasets with different but similar columns

crikriek · Posted 06-29-2014 03:49 AM

Hello everyone,

I'm currently writing my thesis and I'm facing a serious issue. It would be awesome if you could help me with it.

I want to merge two database. The first one (let's call it TRADES) concerns the trades concerning different stocks, this one goes like that:

id_trade	Datetime
320	1/02/2014 09:00
240	1/02/2014 09:03
356	3/02/2014 09:12
908	4/02/2014 09:15
1452	5/02/2014 09:23
6543	6/02/2014 09:45

The second (Let's call it SPREAD) database is like that

Datetime	Bid	Ask
1/02/2014 09:00	12	13
1/02/2014 09:02	14	15
3/02/2014 09:10	13	14
4/02/2014 09:11	13,5	14
5/02/2014 09:24	12,2	13
6/02/2014 09:35	11,3	14

What I would like is to merge these database in order to obtain the last BID and ASK values just before the trade. This would give this

id_trade	Datetime	Bid	Ask
320	1/02/2014 09:00	12	13
240	1/02/2014 09:03	14	15
356	3/02/2014 09:12	13,5	14
908	4/02/2014 09:15	13,5	14
1452	5/02/2014 09:23	13,5	14
6543	6/02/2014 09:45	11,3	14

I've been thinking a lot about this merge but I don't see how to do it. Could you please help me with it ?

Thanks a lot

Ksharp · Posted 06-29-2014 06:29 AM

If I guess right, You are coming from Euro ? and If your requirement is not too high,SQL can achieve that.

data x;
input id_trade     Datetime & anydtdtm.;
format      Datetime Datetime.;
cards;
320      1/02/2014 09:00
240      1/02/2014 09:03
356      3/02/2014 09:12
908      4/02/2014 09:15
1452      5/02/2014 09:23
6543      6/02/2014 09:45
;
run;
data y;
input Datetime & anydtdtm.     (Bid     Ask     ) (: commax.);
format      Datetime Datetime. Bid commax12.2 ;
cards;
1/02/2014 09:00      12     13
1/02/2014 09:02      14     15
3/02/2014 09:10      13     14
4/02/2014 09:11      13,5     14
5/02/2014 09:24      12,2     13
6/02/2014 09:35      11,3     14
;
run;
proc sql;
create table want as 
 select x.*,bid,ask
  from x,y
   where  x.Datetime-y.Datetime ge 0
    group by id_trade
     having min(x.Datetime-y.Datetime)=x.Datetime-y.Datetime ;
quit;

Xia Keshan

View solution in original post

BillJones · Posted 06-29-2014 06:05 AM

crikriek,

You need a unique key on both data sets to perform the merge. I don't know your data, but I imagine something like symbol id_trade should work. If you can get those variables on both files, then you can use code like the following:

proc sort data=trades nodupkey; by symbol id_trade; run; *check log to ensure there are no dups;

proc sort data=spread nodupkey; by symbol id_trade; run; *check log to ensure there are no dups;

data trade_info;

merge trades (in=a) spread (in=b);

by symobl id_trade;

if a;

run;

Code above performs a left outer join. IOW, it keeps all the records from data set a (trades) and attaches information from data set b if the keys match. If you want an inner join which only keeps those records that match then use "if a & b;" instead of "if a;"

-Bill

Ksharp · Posted 06-29-2014 06:29 AM

If I guess right, You are coming from Euro ? and If your requirement is not too high,SQL can achieve that.

data x;
input id_trade     Datetime & anydtdtm.;
format      Datetime Datetime.;
cards;
320      1/02/2014 09:00
240      1/02/2014 09:03
356      3/02/2014 09:12
908      4/02/2014 09:15
1452      5/02/2014 09:23
6543      6/02/2014 09:45
;
run;
data y;
input Datetime & anydtdtm.     (Bid     Ask     ) (: commax.);
format      Datetime Datetime. Bid commax12.2 ;
cards;
1/02/2014 09:00      12     13
1/02/2014 09:02      14     15
3/02/2014 09:10      13     14
4/02/2014 09:11      13,5     14
5/02/2014 09:24      12,2     13
6/02/2014 09:35      11,3     14
;
run;
proc sql;
create table want as 
 select x.*,bid,ask
  from x,y
   where  x.Datetime-y.Datetime ge 0
    group by id_trade
     having min(x.Datetime-y.Datetime)=x.Datetime-y.Datetime ;
quit;

Xia Keshan

crikriek · Posted 06-29-2014 06:37 AM

Thank you guys. I really appreciate your help.

I think the solution of Xia Keshan best suits my needs. I was actually thinking about smthg like that. (And yes I come from Euro )

Thanks

crikriek · Posted 06-30-2014 05:37 AM

Xia Keshan,

I'm having trouble with the implementation of the code. I'm very new in SAS programming and I surely do'nt get all the subtilities of the program. Here's what I typed. LSV in my library. Trades_students_2012 is the dataset with the trades information and Orders_students_2012 is the one with the orders information. Time_OB is the name of the datetime in the orders_students_2012 dataset.

proc sql;

create table want as

select lsv.trades_students_2012.*,bid, ask

from lsv.trades_students_2012,lsv.orders_students_2012

where lsv.trades_students_2012.Datetime-lsv.orders_students_2012.Time_OB ge 0

group by id_trades

having min(lsv.trades_students_2012.Datetime-lsv.Orders_students_2012.Time_OB)=lsv.trades_students_2012.Datetime-lsv.orders_students_2012.Datetime ;

quit;

I also have another problem since the column Datetime in the Trades dataset has been created by myself. I had the date and I had the time so I did this

proc sql;

alter table lsv.Trades_students_2012

add datetime date label='datetime';

Proc sql;

update lsv.Trades_students_2012

set datetime= dhms(date,hour(time),minute(time), second(time));

Is it right ? Could you please help me with all this ?

Ksharp · Posted 06-30-2014 08:22 AM

You don't post your real data. I just guess something . you can wrap all these SQL statement into one proc sql and use alias to make code succinct .

proc sql;

alter table lsv.Trades_students_2012

add datetime date label='datetime';

update lsv.Trades_students_2012

set datetime= dhms(date,hour(time),minute(time), second(time));

create table want as

select a.*,bid, ask

from lsv.trades_students_2012 as a ,lsv.orders_students_2012 as b

where a.Datetime-b.Time_OB ge 0

group by id_trades

having min(a.Datetime-b.Time_OB)=a.Datetime-b.Datetime ;

quit;

Xia Keshan

crikriek · Posted 07-01-2014 03:31 AM

Thanks a lot !

Nevetheless my two datasets are huge (3GB and 2GB) so the procedure takes too much time and space.

Is there any other way to do it ?

gergely_batho · Posted 07-01-2014 05:55 AM

Data step is fast if your dataset is sorted by datetime. You can merge/interleave the 2 datasets.

The problem is, that you cannot use a merge/by statement, because time stamps are not exactly equal.

You cannot use a set/by statement, because in that case when we read from one of the datasets, the information from the other dataset is lost.

The solution is:

data want2;

set y(in=inY keep=datetime) x(in=inX keep=datetime);

by datetime;

if inY then do;

set y;

end;

if inX then do;

set x;

output;

end;

run;

crikriek · Posted 07-01-2014 08:41 AM

Thank you Gergely.

I'm thinking about the SyncJoin algorithm. I read this is a faster way to operate faster the cartesian product. Unfortunately I don't know how to apply it in my case.

gergely_batho · Posted 07-01-2014 09:37 AM

I have looked at the paper describing SyncJoin. If you have a many-to-many relationship, my code won’t work. If it’s one-to-many (or many-to-one) I believe it is at least as fast as SyncJoin.

In the SAS log, you will see, that data step reads each data set 2 times. But in practice we make one-pass through the data. The data is still in the cache, when we re-read it. There is a version of this code that avoids this “apparent re-read”, but it’s complicated.

Have you tried it? You just need to change the names of the tables (x and y to lsv....).

I've used this technique in several projects on millions of rows.

Ksharp · Posted 07-01-2014 09:49 AM

the following could help you understand what Gergely's code was trying to achieve. Of course, your data must be sorted already. or try Hash Table.

data x;
input id_trade     Datetime & anydtdtm.;
format      Datetime Datetime.;
cards;
320      1/02/2014 09:00
240      1/02/2014 09:03
356      3/02/2014 09:12
908      4/02/2014 09:15
1452      5/02/2014 09:23
6543      6/02/2014 09:45
;
run;
data y;
input Datetime & anydtdtm.     (Bid     Ask     ) (: commax.);
format      Datetime Datetime. Bid commax12.2 ;
cards;
1/02/2014 09:00      12     13
1/02/2014 09:02      14     15
3/02/2014 09:10      13     14
4/02/2014 09:11      13,5     14
5/02/2014 09:24      12,2     13
6/02/2014 09:35      11,3     14
;
run;
data want(drop=bid ask);
 set y x(in=inx);
 by Datetime;
 retain _bid _ask;
 if not missing(bid) then _bid=bid;
 if not missing(ask) then _ask=ask;
if inx;
run;

Xia Keshan

crikriek · Posted 07-03-2014 05:23 AM

Thanks a lot guys it worked perfectly !!

Ksharp · Posted 07-03-2014 09:34 AM

Good to know . That is the reason why I love this place, you can still learn something new or good idea.

Registration is open

SAS Training: Just a Click Away