Hi friends,
I have a table A that I need to join to one of 4 source tables. The weird thing is that the source tables house different months... (same format and structure). So the tables houses months as follows:
Table | months available |
A | 1,5,9 |
B | 2,6,10 |
C | 3,7,11 |
D | 4,8,12 |
I have table F like:
ID | START_date | END_date |
12 | 2020-06-01 | 2020-08-30 |
Which i need to left join to one of the below tables where the value appears right before the end_Date
Table A
ID | date | cost |
12 | 2020-01-15 | 5.00 |
12 | 2020-05-20 | 6.00 |
Table b
ID | date | cost |
12 | 2020-02-15 | 5.00 |
12 | 2020-06-20 | 6.00 |
Table c
ID | date | cost |
12 | 2020-03-15 | 5.00 |
12 | 2020-11-20 | 6.00 |
Table D
ID | date | cost |
12 | 2020-04-15 | 5.00 |
12 | 2020-08-12 | 15.00 |
The join should give me because table D has the date and value right before the end_date in Table F
ID | START_date | END_Date | Date | Cost |
12 | 2020-06-01 | 2020-08-30 | 2020-08-12 | 15.00 |
A. How to do the join where the query picks the right table? the tables are huge and I read that using the union function may not be ideal even in a pass through. Also the tables will have multiple values in the same month so i need to pick the cost that was recorded right before the end_Date
Any suggestions would be incredibly helpful
Thanks,
I pulled this suggestion out (deleted it) after re-reading your original post and the mention about the size of tables a-b-c-d. I thought the union of those tables would create an even greater problem. But it could be that SAS/SQL is cleaver enough to avoid creating a temporary table from the union of tables a-b-c-d.
Do you get decent performance with this approach?
here it is, in a more readable format
data a;
input ID date :yymmdd10. cost;
datalines;
12 2020-01-15 5.00
12 2020-05-20 6.00
;
data b;
input ID date :yymmdd10. cost;
datalines;
12 2020-02-15 5.00
12 2020-06-20 6.00
;
data c;
input ID date :yymmdd10. cost;
datalines;
12 2020-03-15 5.00
12 2020-11-20 6.00
;
data d;
input ID date :yymmdd10. cost;
datalines;
12 2020-04-15 5.00
12 2020-08-12 15.00
;
data f;
input ID (START_date END_date) (:yymmdd10.);
format START_date END_date yymmdd10.;
datalines;
12 2020-06-01 2020-08-30
;
proc sql;
create table z as
select
f.*,
x.date format=yymmdd10.,
x.cost
from
f left join
( select * from a
union all
select * from b
union all
select * from c
union all
select * from d
) as x on f.id=x.id and x.date between f.START_date and f.END_date
group by f.id
having x.date=max(x.date);
select * from z;
quit;
Please explain further where the date=08-30 and cost=15 come from. They don't appear anywhere in the example data.
If table f is much smaller than a-b-c-d, it might be better to perform 4 left joins and then keep the max date for each id.
I got an email with your reply and it works thank you
proc sql;
create table z as
select f.*, x.date format=yymmdd10., x.cost
from f left join
( select * from a union all select * from b union all select * from c union all select * from d ) as x on f.id=x.id and x.date
between f.START_date and f.END_dategroup by f.idhaving x.date=max(x.date);
select * from z;quit;
I pulled this suggestion out (deleted it) after re-reading your original post and the mention about the size of tables a-b-c-d. I thought the union of those tables would create an even greater problem. But it could be that SAS/SQL is cleaver enough to avoid creating a temporary table from the union of tables a-b-c-d.
Do you get decent performance with this approach?
here it is, in a more readable format
data a;
input ID date :yymmdd10. cost;
datalines;
12 2020-01-15 5.00
12 2020-05-20 6.00
;
data b;
input ID date :yymmdd10. cost;
datalines;
12 2020-02-15 5.00
12 2020-06-20 6.00
;
data c;
input ID date :yymmdd10. cost;
datalines;
12 2020-03-15 5.00
12 2020-11-20 6.00
;
data d;
input ID date :yymmdd10. cost;
datalines;
12 2020-04-15 5.00
12 2020-08-12 15.00
;
data f;
input ID (START_date END_date) (:yymmdd10.);
format START_date END_date yymmdd10.;
datalines;
12 2020-06-01 2020-08-30
;
proc sql;
create table z as
select
f.*,
x.date format=yymmdd10.,
x.cost
from
f left join
( select * from a
union all
select * from b
union all
select * from c
union all
select * from d
) as x on f.id=x.id and x.date between f.START_date and f.END_date
group by f.id
having x.date=max(x.date);
select * from z;
quit;
Thank you very much. The performance wasn't so bad. Took about 10 minutes. I think it will run longer once the original table gets bigger. Can't think of a way around it but i appreciate your help
For your real life task, it might be worth replacing the "*" with only the actual variable names that you really need.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.