Solved: Re: Joining two datasets with the closest observation in terms of time...

Haoz · Posted 01-23-2014 02:08 AM

Hi,

I would like to merge two datasets based on their time stamp:

Dataset A

TimeA

10:10

10:13

10:35

12:23

Dataset B

TimeB

9:30

9:59

10:13

12:33

Now I would like to, for every entry in dataset A, find the nearest (earlier in term of time) entry in dataset B, the end result would look like:

Dataset merged

TimeA TimeB

10:10 9:59

10:13 10:13

10:35 10:13

12:23 10:13

The psedo code would look like:

proc sql;

create table merged as

select *

from A left Join B

on A.TimeA > B.TimeB

group by TimeA

having (TimeA - TimeB) = min(TimeA - TimeB);

quit;

However, the dataset is prohibitively large (100+GB) for using such method (by merging all the observations in B that matches TimeA>TimeB, the size will be monstrous) , is there a way where I can just read the nearest time stamp from dataset B and merge it into A ?

Many thanks.

Hao

LinusH · Posted 01-23-2014 04:24 AM

Since that B is huge, the SAS format concept is not applicable.

The time interval created is a pre calculation so that it will be directly clear on each if the A.time is a match or not, and that concept could be used in a SQL as well.

Is B large because of many rows, and/or long record?

Either way, I think doing som pre processing steps must be better than an join with open intervals.

From B, keep just the time column, using a data step with REATAIN, so that you can calculate the appropriate time interval which time from A should fall in (in the join to come).

See the bellow code as skeleton based on your sample data, certainly need adjustments to your real life data:

data B;

input TimeB time5.;

format TimeB time8.;

datalines;

9:30

9:59

10:13

12:33

run;

data TimeInterval(rename=prev_TimeB = TimeB);

length start_time end_time 8;

retain start_time 0 prev_TimeB;

set B(keep=TimeB) end = last;

if _n_ ne 1 then do;

end_time = TimeB - ((TimeB - prev_TimeB) / 2);

output;

start_time = end_time + 0.001;

end;

prev_timeB = TimeB;

if last then do;

end_time = '23:59:59.999't;

output;

end;

format prev_timeB start_time end_time time12.3;

drop TimeB;

run;

Then join with a:

proc sql;

create table A2 as

select a.*, b.TimeB

from A

left join TimeInterval as b

on a.timeA between b.start_time and b.end_time;

Depending on what you want in your fineal output, but inner join is prefer if that fulfill your requirement;

proc sql;

create table want as

select a.*, [b-columns]

from A2 inner join B

on A2.TimeB = B.TimeB

;

quit;

An index on B.TimeB could speed things up, depending on how this table is updated.

Good luck, hope to hear some feed-back...

Data never sleeps

View solution in original post

LinusH · Posted 01-23-2014 02:37 AM

Assuming that A is the huge table, and B is relatively small look-up table?

Is there any other information from B you wish to apply to the result set?

If you want only the time, you create a format, and just calculate the boundaries between each "normalized" timestamp.

In your example:

0:0 - 9:44 = 9:30

9:45 - 10:07 = 9:59

10:08 - 11:23 = 10:13

etc

Data never sleeps

Haoz · Posted 01-23-2014 02:42 AM

A is a big table (10+GB) and B is a huge table (100+GB)

The purpose of the merge is to acquire several columns of additional information from B into A.

I do not quite understand the "normalized' timestamp tho

LinusH · Posted 01-23-2014 04:24 AM

Since that B is huge, the SAS format concept is not applicable.

The time interval created is a pre calculation so that it will be directly clear on each if the A.time is a match or not, and that concept could be used in a SQL as well.

Is B large because of many rows, and/or long record?

Either way, I think doing som pre processing steps must be better than an join with open intervals.

From B, keep just the time column, using a data step with REATAIN, so that you can calculate the appropriate time interval which time from A should fall in (in the join to come).

See the bellow code as skeleton based on your sample data, certainly need adjustments to your real life data:

data B;

input TimeB time5.;

format TimeB time8.;

datalines;

9:30

9:59

10:13

12:33

run;

data TimeInterval(rename=prev_TimeB = TimeB);

length start_time end_time 8;

retain start_time 0 prev_TimeB;

set B(keep=TimeB) end = last;

if _n_ ne 1 then do;

end_time = TimeB - ((TimeB - prev_TimeB) / 2);

output;

start_time = end_time + 0.001;

end;

prev_timeB = TimeB;

if last then do;

end_time = '23:59:59.999't;

output;

end;

format prev_timeB start_time end_time time12.3;

drop TimeB;

run;

Then join with a:

proc sql;

create table A2 as

select a.*, b.TimeB

from A

left join TimeInterval as b

on a.timeA between b.start_time and b.end_time;

Depending on what you want in your fineal output, but inner join is prefer if that fulfill your requirement;

proc sql;

create table want as

select a.*, [b-columns]

from A2 inner join B

on A2.TimeB = B.TimeB

;

quit;

An index on B.TimeB could speed things up, depending on how this table is updated.

Good luck, hope to hear some feed-back...

Data never sleeps

Haoz · Posted 01-23-2014 05:03 AM

What a brilliant idea, I think I understand what you were trying to do in your code. I will try to work out the kinks and let you know.

Haoz · Posted 01-31-2014 05:42 AM

I applied your idea to my data and it worked. Thank you.

Patrick · Posted 01-23-2014 04:46 AM

Is - by any chance - table A and/or B already sorted by time (eg. if this is a transactional table)?

Are these tables in SAS or a relational data base (and if DB: which one)?

When it comes to "coding for performance" then it's really important to understand the physical implementation and the table structure as good as possible. So what are the real keys? Is this really time or are these timestamps? Is this the key or are there more columns involved. Are there indexes? If in a DB: Is the table partitioned and if yes: how? What's the size of all the variables you want to look-up (eg: would they fit into a hash)? How much memory do you have available? What's the SAS version and the OS?

When it then really comes to tweaking things: Is this a one off task or something you must run on a regular basis. And if yes: How often? If you have to run it on a regular basis: Is it an option to change how the table is stored - eg. using the SPDE engine, modify the pagesize, add indexes,....

Haoz · Posted 01-23-2014 05:13 AM

Yes they are sorted by time, the datasets are stock transactions and order book dynamics throughout many days and many stocks.

Yes they are tables in SAS. The most basic kind I suppose since I loaded them from a downloaded CSV file from a commercial (mostly academic) database.

I am really a beginner, therefore I cannot answer those more in-depth questions, my apologies.

Joining two datasets with the closest observation in terms of time stamp

Re: Joining two datasets with the closest observation in terms of time stamp

Re: Joining two datasets with the closest observation in terms of time stamp

Re: Joining two datasets with the closest observation in terms of time stamp

Re: Joining two datasets with the closest observation in terms of time stamp

Re: Joining two datasets with the closest observation in terms of time stamp

Re: Joining two datasets with the closest observation in terms of time stamp

Re: Joining two datasets with the closest observation in terms of time stamp

Re: Joining two datasets with the closest observation in terms of time stamp

SAS Innovate 2025: Call for Content

Classroom Training Available!