BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Haoz
Calcite | Level 5

Hi,

I would like to merge two datasets based on their time stamp:

Dataset A

TimeA 

10:10  

10:13  

10:35  

12:23  

Dataset B

TimeB

9:30

9:59

10:13

12:33

Now I would like to, for every entry in dataset A, find the nearest (earlier in term of time) entry in dataset B, the end result would look like:

Dataset merged

TimeA   TimeB

10:10    9:59

10:13    10:13

10:35    10:13

12:23    10:13

The psedo code would look like:

proc sql;

  create table merged as

  select *

  from A left Join B

  on A.TimeA > B.TimeB

  group by TimeA

  having (TimeA - TimeB) = min(TimeA - TimeB);

quit;

However, the dataset is prohibitively large (100+GB) for using such method (by merging all the observations in B that matches TimeA>TimeB, the size will be monstrous) , is there a way where I can just read the nearest time stamp from dataset B and merge it into A ?

Many thanks.

Hao

1 ACCEPTED SOLUTION

Accepted Solutions
LinusH
Tourmaline | Level 20

Since that B is huge, the SAS format concept is not applicable.

The time interval created is a pre calculation so that it will be directly clear on each if the A.time is a match or not, and that concept could be used in a SQL as well.

Is B large because of many rows, and/or long record?

Either way, I think doing som pre processing steps must be better than an join with open intervals.

From B, keep just the time column, using a data step with REATAIN, so that you can calculate the appropriate time interval which time from A should fall in (in the join to come).

See the bellow code as skeleton based on your sample data, certainly need adjustments to your real life data:

data B;

    input TimeB time5.;

    format TimeB time8.;

    datalines;

9:30

9:59

10:13

12:33

run;

data TimeInterval(rename=prev_TimeB = TimeB);

    length start_time end_time 8;

    retain start_time 0 prev_TimeB;

    set B(keep=TimeB) end = last;

    if _n_ ne 1 then do;

        end_time = TimeB - ((TimeB - prev_TimeB) / 2);

        output;

        start_time = end_time + 0.001;

    end;

    prev_timeB = TimeB;

    if last then do;

        end_time = '23:59:59.999't;

        output;

    end;

    format prev_timeB start_time end_time time12.3;

    drop TimeB;

run;

Then join with a:

proc sql;

create table A2 as

select a.*, b.TimeB

from A

left join TimeInterval as b

on a.timeA between b.start_time and b.end_time;

Depending on what you want in your fineal output, but inner join is prefer if that fulfill your requirement;

proc sql;

create table want as

select a.*, [b-columns]

from A2 inner join B

on A2.TimeB = B.TimeB

;

quit;

An index on B.TimeB could speed things up, depending on how this table is updated.

Good luck, hope to hear some feed-back...

Data never sleeps

View solution in original post

7 REPLIES 7
LinusH
Tourmaline | Level 20

Assuming that A is the huge table, and B is relatively small look-up  table?

Is there any other information from B you wish to apply to the result set?

If you want only the time, you create a format, and just calculate the boundaries between each "normalized" timestamp.

In your example:

0:0 - 9:44 = 9:30

9:45 - 10:07 = 9:59

10:08 - 11:23 = 10:13

etc

Data never sleeps
Haoz
Calcite | Level 5

A is a big table (10+GB) and B is a huge table (100+GB)

The purpose of the merge is to acquire several columns of additional information from B into A.

I do not quite understand the "normalized' timestamp tho Smiley Sad

LinusH
Tourmaline | Level 20

Since that B is huge, the SAS format concept is not applicable.

The time interval created is a pre calculation so that it will be directly clear on each if the A.time is a match or not, and that concept could be used in a SQL as well.

Is B large because of many rows, and/or long record?

Either way, I think doing som pre processing steps must be better than an join with open intervals.

From B, keep just the time column, using a data step with REATAIN, so that you can calculate the appropriate time interval which time from A should fall in (in the join to come).

See the bellow code as skeleton based on your sample data, certainly need adjustments to your real life data:

data B;

    input TimeB time5.;

    format TimeB time8.;

    datalines;

9:30

9:59

10:13

12:33

run;

data TimeInterval(rename=prev_TimeB = TimeB);

    length start_time end_time 8;

    retain start_time 0 prev_TimeB;

    set B(keep=TimeB) end = last;

    if _n_ ne 1 then do;

        end_time = TimeB - ((TimeB - prev_TimeB) / 2);

        output;

        start_time = end_time + 0.001;

    end;

    prev_timeB = TimeB;

    if last then do;

        end_time = '23:59:59.999't;

        output;

    end;

    format prev_timeB start_time end_time time12.3;

    drop TimeB;

run;

Then join with a:

proc sql;

create table A2 as

select a.*, b.TimeB

from A

left join TimeInterval as b

on a.timeA between b.start_time and b.end_time;

Depending on what you want in your fineal output, but inner join is prefer if that fulfill your requirement;

proc sql;

create table want as

select a.*, [b-columns]

from A2 inner join B

on A2.TimeB = B.TimeB

;

quit;

An index on B.TimeB could speed things up, depending on how this table is updated.

Good luck, hope to hear some feed-back...

Data never sleeps
Haoz
Calcite | Level 5

What a brilliant idea, I think I understand what you were trying to do in your code. I will try to work out the kinks and let you know.

Haoz
Calcite | Level 5

I applied your idea to my data and it worked. Thank you.

Patrick
Opal | Level 21

Is - by any chance - table A and/or B already sorted by time (eg. if this is a transactional table)?

Are these tables in SAS or a relational data base (and if DB: which one)?

When it comes to "coding for performance" then it's really important to understand the physical implementation and the table structure as good as possible. So what are the real keys? Is this really time or are these timestamps? Is this the key or are there more columns involved. Are there indexes? If in a DB: Is the table partitioned and if yes: how? What's the size of all the variables you want to look-up (eg: would they fit into a hash)? How much memory do you have available? What's the SAS version and the OS?

When it then really comes to tweaking things: Is this a one off task or something you must run on a regular basis. And if yes: How often? If you have to run it on a regular basis: Is it an option to change how the table is stored - eg. using the SPDE engine, modify the pagesize, add indexes,....

Haoz
Calcite | Level 5

Yes they are sorted by time, the datasets are stock transactions and order book dynamics throughout many days and many stocks.

Yes they are tables in SAS. The most basic kind I suppose since I loaded them from a downloaded CSV file from a commercial (mostly academic) database.

I am really a beginner, therefore I cannot answer those more in-depth questions, my apologies.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 2867 views
  • 6 likes
  • 3 in conversation