Hi friends,
I have a table A that I need to join to one of 4 source tables. The weird thing is that the source tables house different months... (same format and structure). So the tables houses months as follows:
| Table | months available |
| A | 1,5,9 |
| B | 2,6,10 |
| C | 3,7,11 |
| D | 4,8,12 |
I have table F like:
| ID | START_date | END_date |
| 12 | 2020-06-01 | 2020-08-30 |
Which i need to left join to one of the below tables where the value appears right before the end_Date
Table A
| ID | date | cost |
| 12 | 2020-01-15 | 5.00 |
| 12 | 2020-05-20 | 6.00 |
Table b
| ID | date | cost |
| 12 | 2020-02-15 | 5.00 |
| 12 | 2020-06-20 | 6.00 |
Table c
| ID | date | cost |
| 12 | 2020-03-15 | 5.00 |
| 12 | 2020-11-20 | 6.00 |
Table D
| ID | date | cost |
| 12 | 2020-04-15 | 5.00 |
| 12 | 2020-08-12 | 15.00 |
The join should give me because table D has the date and value right before the end_date in Table F
| ID | START_date | END_Date | Date | Cost |
| 12 | 2020-06-01 | 2020-08-30 | 2020-08-12 | 15.00 |
A. How to do the join where the query picks the right table? the tables are huge and I read that using the union function may not be ideal even in a pass through. Also the tables will have multiple values in the same month so i need to pick the cost that was recorded right before the end_Date
Any suggestions would be incredibly helpful
Thanks,
I pulled this suggestion out (deleted it) after re-reading your original post and the mention about the size of tables a-b-c-d. I thought the union of those tables would create an even greater problem. But it could be that SAS/SQL is cleaver enough to avoid creating a temporary table from the union of tables a-b-c-d.
Do you get decent performance with this approach?
here it is, in a more readable format
data a;
input ID date :yymmdd10. cost;
datalines;
12 2020-01-15 5.00
12 2020-05-20 6.00
;
data b;
input ID date :yymmdd10. cost;
datalines;
12 2020-02-15 5.00
12 2020-06-20 6.00
;
data c;
input ID date :yymmdd10. cost;
datalines;
12 2020-03-15 5.00
12 2020-11-20 6.00
;
data d;
input ID date :yymmdd10. cost;
datalines;
12 2020-04-15 5.00
12 2020-08-12 15.00
;
data f;
input ID (START_date END_date) (:yymmdd10.);
format START_date END_date yymmdd10.;
datalines;
12 2020-06-01 2020-08-30
;
proc sql;
create table z as
select
f.*,
x.date format=yymmdd10.,
x.cost
from
f left join
( select * from a
union all
select * from b
union all
select * from c
union all
select * from d
) as x on f.id=x.id and x.date between f.START_date and f.END_date
group by f.id
having x.date=max(x.date);
select * from z;
quit;
Please explain further where the date=08-30 and cost=15 come from. They don't appear anywhere in the example data.
If table f is much smaller than a-b-c-d, it might be better to perform 4 left joins and then keep the max date for each id.
I got an email with your reply and it works thank you
proc sql;
create table z as
select f.*, x.date format=yymmdd10., x.cost
from f left join
( select * from a union all select * from b union all select * from c union all select * from d ) as x on f.id=x.id and x.date
between f.START_date and f.END_dategroup by f.idhaving x.date=max(x.date);
select * from z;quit;
I pulled this suggestion out (deleted it) after re-reading your original post and the mention about the size of tables a-b-c-d. I thought the union of those tables would create an even greater problem. But it could be that SAS/SQL is cleaver enough to avoid creating a temporary table from the union of tables a-b-c-d.
Do you get decent performance with this approach?
here it is, in a more readable format
data a;
input ID date :yymmdd10. cost;
datalines;
12 2020-01-15 5.00
12 2020-05-20 6.00
;
data b;
input ID date :yymmdd10. cost;
datalines;
12 2020-02-15 5.00
12 2020-06-20 6.00
;
data c;
input ID date :yymmdd10. cost;
datalines;
12 2020-03-15 5.00
12 2020-11-20 6.00
;
data d;
input ID date :yymmdd10. cost;
datalines;
12 2020-04-15 5.00
12 2020-08-12 15.00
;
data f;
input ID (START_date END_date) (:yymmdd10.);
format START_date END_date yymmdd10.;
datalines;
12 2020-06-01 2020-08-30
;
proc sql;
create table z as
select
f.*,
x.date format=yymmdd10.,
x.cost
from
f left join
( select * from a
union all
select * from b
union all
select * from c
union all
select * from d
) as x on f.id=x.id and x.date between f.START_date and f.END_date
group by f.id
having x.date=max(x.date);
select * from z;
quit;
Thank you very much. The performance wasn't so bad. Took about 10 minutes. I think it will run longer once the original table gets bigger. Can't think of a way around it but i appreciate your help
For your real life task, it might be worth replacing the "*" with only the actual variable names that you really need.
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.