Solved: Re: PROC SQL - select distinct w/out multiplying in a many-to-many joi...

N8 · Posted 03-27-2020 12:36 PM

Hi all,

I have a many-to-many join issue arising using sql trying to join two tables that have a different number of records and both have repeating subjects. My need: combine all three tables without multiplying them.

Scenario:

Table 1 is demographics table and this is a unique list without repeating subjects. Table 2 is an event table that is allowed to have repeating events and they are organized by calendar date. Table 3 is another event table and it is also organized by calendar date. In trying to do left inner joins what happens is that if I have 10 events in Table 2 and 20 events in Table 3, my sql script is generating 20 total events. I believe this is an issue with not using the correct placement of 'select distinct', but I haven't been able my script into the right format. Below is some sample tables that I constructed as well as the desired output - any help would be appreciated. thank you and stay healthy.

Sample script:

data Have1;
input ID $ age1 $ ;
datalines ;
a 20
b 28
c 23
d 31
;

data Have2;
input ID $ Date1 mmddyy10. Type1 $ ;
format date1 mmddyy8.;
datalines;
A 08/01/2009 OP
A 08/09/2009 OP
A 08/27/2009 OP
A 08/31/2009 OP
B 09/15/2009 OP
B 09/16/2009 OP
C 10/02/2009 OP
D 10/15/2009 OP
D 10/24/2009 OP
D 10/31/2009 OP
;

data Have3 ;
input ID $ Date2 mmddyy10. type2 $;
format date2 mmddyy8.;
datalines ;
A 08/05/2009 DH
B 09/03/2009 DH
B 09/16/2009 DH
B 09/24/2009 DH
C 10/24/2009 DH
C 10/31/2009 DH
D 11/02/2009 DH
;

Want (sample below):

A Table that has 16 total events (there were 10 type 1 events and 7 type 2 events, but two events occurred on same day so would be listed together).

ID	Age	Date1	Type1	Date2	Type2
A	20	8/1/09	OP	.	.
A	20	.	.	8/5/09	DH
A	20	8/9/09	OP	.	.
A	20	8/27/09	OP	.	.
A	20	8/31/09	OP	.	.
B	28	.	.	9/3/09	.
B	28	9/15/09	OP	.	.
B	28	9/16/09	OP	9/16/09	DH
B	28	.	.	9/24/09	DH
C	23	10/2/09	OP	.	.
C	23	.	.	10/24/09	DH
C	23	.	.	10/31/09	DH
D	31	10/15/09	OP	.	.
D	31	10/24/09	OP	.	.
D	31	10/31/09	OP	.	.
D	31	.	.	11/2/09	DH

ballardw · Posted 03-27-2020 04:43 PM

Your example data has some issues.

First the case of your ID variable is different in the first set, so it will not match the others unless you make the case match.

The second is since you want to match on date then don't create two different date variables. A date plus event tells you which occurred. So just use one date and then you can get two (or more) event types on the same days easily.

data Have1;
input ID $ age1 $ ;
datalines ;
A 20
B 28
C 23
D 31
;

data Have2;
input ID $ Date mmddyy10. Type1 $ ;
format date mmddyy8.;
datalines;
A 08/01/2009 OP
A 08/09/2009 OP
A 08/27/2009 OP
A 08/31/2009 OP
B 09/15/2009 OP
B 09/16/2009 OP
C 10/02/2009 OP
D 10/15/2009 OP
D 10/24/2009 OP
D 10/31/2009 OP
;

data Have3 ;
input ID $ Date mmddyy10. type2 $;
format date mmddyy8.;
datalines ;
A 08/05/2009 DH
B 09/03/2009 DH
B 09/16/2009 DH
B 09/24/2009 DH
C 10/24/2009 DH
C 10/31/2009 DH
D 11/02/2009 DH
;

proc sort data=have1;
    by id ;
run;
proc sort data=have2;
   by id date;
run;
proc sort data=have3;
	by id date;
run;
data temp;
   merge have2 have3;
	by id date;
run;

data want;
   merge have1 temp;
	by id;
run;

I used a merge because for the problem example , with two different matching criteria Id plus Id and Date, that made more sense. I'm sure someone can force proc sql to do this but I'm too lazy to write nested from clauses if I don't see any advantage.

You really should show the code you attempted. Some times you might be close and only need a set of parentheses or a different ON (needed in this case for sure) condition.

View solution in original post

ballardw · Posted 03-27-2020 04:43 PM

Your example data has some issues.

First the case of your ID variable is different in the first set, so it will not match the others unless you make the case match.

The second is since you want to match on date then don't create two different date variables. A date plus event tells you which occurred. So just use one date and then you can get two (or more) event types on the same days easily.

data Have1;
input ID $ age1 $ ;
datalines ;
A 20
B 28
C 23
D 31
;

data Have2;
input ID $ Date mmddyy10. Type1 $ ;
format date mmddyy8.;
datalines;
A 08/01/2009 OP
A 08/09/2009 OP
A 08/27/2009 OP
A 08/31/2009 OP
B 09/15/2009 OP
B 09/16/2009 OP
C 10/02/2009 OP
D 10/15/2009 OP
D 10/24/2009 OP
D 10/31/2009 OP
;

data Have3 ;
input ID $ Date mmddyy10. type2 $;
format date mmddyy8.;
datalines ;
A 08/05/2009 DH
B 09/03/2009 DH
B 09/16/2009 DH
B 09/24/2009 DH
C 10/24/2009 DH
C 10/31/2009 DH
D 11/02/2009 DH
;

proc sort data=have1;
    by id ;
run;
proc sort data=have2;
   by id date;
run;
proc sort data=have3;
	by id date;
run;
data temp;
   merge have2 have3;
	by id date;
run;

data want;
   merge have1 temp;
	by id;
run;

I used a merge because for the problem example , with two different matching criteria Id plus Id and Date, that made more sense. I'm sure someone can force proc sql to do this but I'm too lazy to write nested from clauses if I don't see any advantage.

You really should show the code you attempted. Some times you might be close and only need a set of parentheses or a different ON (needed in this case for sure) condition.

JerryV · Posted 03-27-2020 05:04 PM

Is this what was intended? Does somebody want to simplify this? I still like @ballardw 's better.

proc sql noprint;
create table want1 as 
select coalesce(a.id, b.id) as id
	     , a.age1
	     , b.date1, b.type1 
	from have1 as a
	full join 
	     have2 as b
	on upcase(a.id) = upcase(b.id)
;
create table want2 as 
select coalesce(a.id, c.id) as id
     , a.age1
     , c.date2, c.type2
from have1 as a
full join 
     have3 as c
on upcase(a.id) = upcase(c.id)
; 
create table WANT as 
select coalesce(w1.id, w2.id) as id
     , coalesce(w1.age1, w2.age1) as age1
     , w1.date1, w1.type1
	 , w2.date2, w2.type2
from want1 as w1
full join 
     want2 as w2
on upcase(w1.id) = upcase(w2.id) 
   and w1.age1 = w2.age1
   and w1.date1 = w2.date2
; quit;

PGStats · Posted 03-28-2020 06:13 PM

With SQL I would do:

proc sql;
create table want as
select 
	coalesce(a.id, b.id) as id,
	c.age1,
	a.date1,
	a.type1,
	b.date2,
	b.type2
from 
	have2 as a full join
	have3 as b on a.id=b.id and a.date1=b.date2 left join
	have1 as c on coalesce(a.id, b.id) = upcase(c.id);
quit;

PG

PROC SQL - select distinct w/out multiplying in a many-to-many join

Re: PROC SQL - select distinct w/out multiplying in a many-to-many join

Re: PROC SQL - select distinct w/out multiplying in a many-to-many join

Re: PROC SQL - select distinct w/out multiplying in a many-to-many join

Re: PROC SQL - select distinct w/out multiplying in a many-to-many join

PROC SQL - select distinct w/out multiplying in a many-to-many join

Re: PROC SQL - select distinct w/out multiplying in a many-to-many join

Re: PROC SQL - select distinct w/out multiplying in a many-to-many join

Re: PROC SQL - select distinct w/out multiplying in a many-to-many join

Re: PROC SQL - select distinct w/out multiplying in a many-to-many join

Registration is open

SAS Training: Just a Click Away