Hi all,
I have a many-to-many join issue arising using sql trying to join two tables that have a different number of records and both have repeating subjects. My need: combine all three tables without multiplying them.
Scenario:
Table 1 is demographics table and this is a unique list without repeating subjects. Table 2 is an event table that is allowed to have repeating events and they are organized by calendar date. Table 3 is another event table and it is also organized by calendar date. In trying to do left inner joins what happens is that if I have 10 events in Table 2 and 20 events in Table 3, my sql script is generating 20 total events. I believe this is an issue with not using the correct placement of 'select distinct', but I haven't been able my script into the right format. Below is some sample tables that I constructed as well as the desired output - any help would be appreciated. thank you and stay healthy.
Sample script:
data Have1;
input ID $ age1 $ ;
datalines ;
a 20
b 28
c 23
d 31
;
data Have2;
input ID $ Date1 mmddyy10. Type1 $ ;
format date1 mmddyy8.;
datalines;
A 08/01/2009 OP
A 08/09/2009 OP
A 08/27/2009 OP
A 08/31/2009 OP
B 09/15/2009 OP
B 09/16/2009 OP
C 10/02/2009 OP
D 10/15/2009 OP
D 10/24/2009 OP
D 10/31/2009 OP
;
data Have3 ;
input ID $ Date2 mmddyy10. type2 $;
format date2 mmddyy8.;
datalines ;
A 08/05/2009 DH
B 09/03/2009 DH
B 09/16/2009 DH
B 09/24/2009 DH
C 10/24/2009 DH
C 10/31/2009 DH
D 11/02/2009 DH
;
Want (sample below):
A Table that has 16 total events (there were 10 type 1 events and 7 type 2 events, but two events occurred on same day so would be listed together).
ID | Age | Date1 | Type1 | Date2 | Type2 |
A | 20 | 8/1/09 | OP | . | . |
A | 20 | . | . | 8/5/09 | DH |
A | 20 | 8/9/09 | OP | . | . |
A | 20 | 8/27/09 | OP | . | . |
A | 20 | 8/31/09 | OP | . | . |
B | 28 | . | . | 9/3/09 | . |
B | 28 | 9/15/09 | OP | . | . |
B | 28 | 9/16/09 | OP | 9/16/09 | DH |
B | 28 | . | . | 9/24/09 | DH |
C | 23 | 10/2/09 | OP | . | . |
C | 23 | . | . | 10/24/09 | DH |
C | 23 | . | . | 10/31/09 | DH |
D | 31 | 10/15/09 | OP | . | . |
D | 31 | 10/24/09 | OP | . | . |
D | 31 | 10/31/09 | OP | . | . |
D | 31 | . | . | 11/2/09 | DH |
Your example data has some issues.
First the case of your ID variable is different in the first set, so it will not match the others unless you make the case match.
The second is since you want to match on date then don't create two different date variables. A date plus event tells you which occurred. So just use one date and then you can get two (or more) event types on the same days easily.
data Have1; input ID $ age1 $ ; datalines ; A 20 B 28 C 23 D 31 ; data Have2; input ID $ Date mmddyy10. Type1 $ ; format date mmddyy8.; datalines; A 08/01/2009 OP A 08/09/2009 OP A 08/27/2009 OP A 08/31/2009 OP B 09/15/2009 OP B 09/16/2009 OP C 10/02/2009 OP D 10/15/2009 OP D 10/24/2009 OP D 10/31/2009 OP ; data Have3 ; input ID $ Date mmddyy10. type2 $; format date mmddyy8.; datalines ; A 08/05/2009 DH B 09/03/2009 DH B 09/16/2009 DH B 09/24/2009 DH C 10/24/2009 DH C 10/31/2009 DH D 11/02/2009 DH ; proc sort data=have1; by id ; run; proc sort data=have2; by id date; run; proc sort data=have3; by id date; run; data temp; merge have2 have3; by id date; run; data want; merge have1 temp; by id; run;
I used a merge because for the problem example , with two different matching criteria Id plus Id and Date, that made more sense. I'm sure someone can force proc sql to do this but I'm too lazy to write nested from clauses if I don't see any advantage.
You really should show the code you attempted. Some times you might be close and only need a set of parentheses or a different ON (needed in this case for sure) condition.
Your example data has some issues.
First the case of your ID variable is different in the first set, so it will not match the others unless you make the case match.
The second is since you want to match on date then don't create two different date variables. A date plus event tells you which occurred. So just use one date and then you can get two (or more) event types on the same days easily.
data Have1; input ID $ age1 $ ; datalines ; A 20 B 28 C 23 D 31 ; data Have2; input ID $ Date mmddyy10. Type1 $ ; format date mmddyy8.; datalines; A 08/01/2009 OP A 08/09/2009 OP A 08/27/2009 OP A 08/31/2009 OP B 09/15/2009 OP B 09/16/2009 OP C 10/02/2009 OP D 10/15/2009 OP D 10/24/2009 OP D 10/31/2009 OP ; data Have3 ; input ID $ Date mmddyy10. type2 $; format date mmddyy8.; datalines ; A 08/05/2009 DH B 09/03/2009 DH B 09/16/2009 DH B 09/24/2009 DH C 10/24/2009 DH C 10/31/2009 DH D 11/02/2009 DH ; proc sort data=have1; by id ; run; proc sort data=have2; by id date; run; proc sort data=have3; by id date; run; data temp; merge have2 have3; by id date; run; data want; merge have1 temp; by id; run;
I used a merge because for the problem example , with two different matching criteria Id plus Id and Date, that made more sense. I'm sure someone can force proc sql to do this but I'm too lazy to write nested from clauses if I don't see any advantage.
You really should show the code you attempted. Some times you might be close and only need a set of parentheses or a different ON (needed in this case for sure) condition.
Is this what was intended? Does somebody want to simplify this? I still like @ballardw 's better.
proc sql noprint;
create table want1 as
select coalesce(a.id, b.id) as id
, a.age1
, b.date1, b.type1
from have1 as a
full join
have2 as b
on upcase(a.id) = upcase(b.id)
;
create table want2 as
select coalesce(a.id, c.id) as id
, a.age1
, c.date2, c.type2
from have1 as a
full join
have3 as c
on upcase(a.id) = upcase(c.id)
;
create table WANT as
select coalesce(w1.id, w2.id) as id
, coalesce(w1.age1, w2.age1) as age1
, w1.date1, w1.type1
, w2.date2, w2.type2
from want1 as w1
full join
want2 as w2
on upcase(w1.id) = upcase(w2.id)
and w1.age1 = w2.age1
and w1.date1 = w2.date2
; quit;
With SQL I would do:
proc sql;
create table want as
select
coalesce(a.id, b.id) as id,
c.age1,
a.date1,
a.type1,
b.date2,
b.type2
from
have2 as a full join
have3 as b on a.id=b.id and a.date1=b.date2 left join
have1 as c on coalesce(a.id, b.id) = upcase(c.id);
quit;
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.