Solved: Re: Join to one of 4 tables based on a date condition

TheNovice · Posted 10-16-2020 08:11 PM

Hi friends,

I have a table A that I need to join to one of 4 source tables. The weird thing is that the source tables house different months... (same format and structure). So the tables houses months as follows:

Table	months available
A	1,5,9
B	2,6,10
C	3,7,11
D	4,8,12

I have table F like:

ID	START_date	END_date
12	2020-06-01	2020-08-30

Which i need to left join to one of the below tables where the value appears right before the end_Date

Table A

ID	date	cost
12	2020-01-15	5.00
12	2020-05-20	6.00

Table b

ID	date	cost
12	2020-02-15	5.00
12	2020-06-20	6.00

Table c

ID	date	cost
12	2020-03-15	5.00
12	2020-11-20	6.00

Table D

ID	date	cost
12	2020-04-15	5.00
12	2020-08-12	15.00

The join should give me because table D has the date and value right before the end_date in Table F

ID	START_date	END_Date	Date	Cost
12	2020-06-01	2020-08-30	2020-08-12	15.00

A. How to do the join where the query picks the right table? the tables are huge and I read that using the union function may not be ideal even in a pass through. Also the tables will have multiple values in the same month so i need to pick the cost that was recorded right before the end_Date

Any suggestions would be incredibly helpful

Thanks,

PGStats · Posted 10-18-2020 01:00 PM

I pulled this suggestion out (deleted it) after re-reading your original post and the mention about the size of tables a-b-c-d. I thought the union of those tables would create an even greater problem. But it could be that SAS/SQL is cleaver enough to avoid creating a temporary table from the union of tables a-b-c-d.

Do you get decent performance with this approach?

here it is, in a more readable format

data a;
input ID  date :yymmdd10.    cost;
datalines;
12 2020-01-15  5.00
12 2020-05-20                         6.00
;
data b;
input ID  date :yymmdd10.    cost;
datalines;
12 2020-02-15  5.00
12 2020-06-20                         6.00
;
data c;
input ID  date :yymmdd10.    cost;
datalines;
12 2020-03-15  5.00
12 2020-11-20                         6.00
;
data d;
input ID  date :yymmdd10.    cost;
datalines;
12 2020-04-15  5.00
12 2020-08-12                         15.00
;

data f;
input ID (START_date  END_date) (:yymmdd10.);
format START_date  END_date yymmdd10.;
datalines;
12 2020-06-01  2020-08-30
;

proc sql;
create table z as
select
    f.*,
    x.date format=yymmdd10.,
    x.cost
from 
    f left join
    (   select * from a 
        union all 
        select * from b 
        union all
        select * from c 
        union all
        select * from d 
        ) as x on f.id=x.id and x.date between f.START_date and f.END_date
group by f.id
having x.date=max(x.date);
select * from z;
quit;

PG

View solution in original post

PGStats · Posted 10-16-2020 10:39 PM

Please explain further where the date=08-30 and cost=15 come from. They don't appear anywhere in the example data.

PG

TheNovice · Posted 10-16-2020 11:02 PM

Apologies, i have edited the post and made the correction. Does that make sense?

PGStats · Posted 10-16-2020 11:38 PM

If table f is much smaller than a-b-c-d, it might be better to perform 4 left joins and then keep the max date for each id.

PG

TheNovice · Posted 10-16-2020 11:42 PM

Smaller yes but table F is still over a million records... right now i am doing 4 separate left joins and then using data step to consolidate all....

TheNovice · Posted 10-17-2020 10:45 PM

I got an email with your reply and it works thank you

proc sql;

create table z as

select f.*, x.date format=yymmdd10., x.cost

from f left join

( select * from a union all select * from b union all select * from c union all select * from d ) as x on f.id=x.id and x.date

between f.START_date and f.END_dategroup by f.idhaving x.date=max(x.date);

select * from z;quit;

PGStats · Posted 10-18-2020 01:00 PM

I pulled this suggestion out (deleted it) after re-reading your original post and the mention about the size of tables a-b-c-d. I thought the union of those tables would create an even greater problem. But it could be that SAS/SQL is cleaver enough to avoid creating a temporary table from the union of tables a-b-c-d.

Do you get decent performance with this approach?

here it is, in a more readable format

data a;
input ID  date :yymmdd10.    cost;
datalines;
12 2020-01-15  5.00
12 2020-05-20                         6.00
;
data b;
input ID  date :yymmdd10.    cost;
datalines;
12 2020-02-15  5.00
12 2020-06-20                         6.00
;
data c;
input ID  date :yymmdd10.    cost;
datalines;
12 2020-03-15  5.00
12 2020-11-20                         6.00
;
data d;
input ID  date :yymmdd10.    cost;
datalines;
12 2020-04-15  5.00
12 2020-08-12                         15.00
;

data f;
input ID (START_date  END_date) (:yymmdd10.);
format START_date  END_date yymmdd10.;
datalines;
12 2020-06-01  2020-08-30
;

proc sql;
create table z as
select
    f.*,
    x.date format=yymmdd10.,
    x.cost
from 
    f left join
    (   select * from a 
        union all 
        select * from b 
        union all
        select * from c 
        union all
        select * from d 
        ) as x on f.id=x.id and x.date between f.START_date and f.END_date
group by f.id
having x.date=max(x.date);
select * from z;
quit;

PG

TheNovice · Posted 10-18-2020 10:12 PM

Thank you very much. The performance wasn't so bad. Took about 10 minutes. I think it will run longer once the original table gets bigger. Can't think of a way around it but i appreciate your help

PGStats · Posted 10-18-2020 11:25 PM

For your real life task, it might be worth replacing the "*" with only the actual variable names that you really need.

PG

TheNovice · Posted 10-18-2020 11:45 PM

You are correct. the 10 min run time was with me choosing the correct variables. With * it was closer to 30 mins

Registration is open

SAS Training: Just a Click Away