Re: joining two tables

pacman94 · Posted 07-16-2021 12:00 PM

Hello . I need help with combining two datasets with all time points for each ID even though they are not in dataset #2:

dataset #1: 'event' table

event

Month 1

Month 2

Month 3

dataset#2 'id_all'

ID	event	value	value2
5656	Month 1	13	324
6565	Month 2	31	465
4663	Month 1	14	697
4242
1234	Month 3	14	656
5421	Month 1	12	647

want:

ID	event	value	value2
5656	Month 1	13	324
5656	Month 2
5656	Month 3
6565	Month 1
6565	Month 2	31	465
6565	Month 3
4663	Month 1	14	697
4663	Month 2
4663	Month 3
4242
4242	Month 1
4242	Month 2
4242	Month 3
5421	Month 1	12	647
5421	Month 2
5421	Month 3
1234	Month 1
1234	Month 2
1234	Month 3	14	656

here's my attempt so far:

proc sql;

select a.*, b.event

from id_all a right join event b on a.event = b.event;

quit;

doesn't work:(

Reeza · Posted 07-16-2021 12:23 PM

Change your code from a right join, which only keeps records in the right table, to a full join, where you keep records from both tables.

maguiremq · Posted 07-16-2021 12:38 PM

In your 'want' table, do you need that null value for event for id 4242?

pacman94 · Posted 07-16-2021 12:40 PM

Yes

PGStats · Posted 07-16-2021 12:56 PM

If every event is represented at least once in dataset2, you don't even need dataset1:

proc sql;
select  	
	a.ID, 
	b.event,
	c.value,
	c.value2
from	
	(select distinct ID from dataset2) as a cross join
	(select distinct event from dataset2) as b left join
	dataset2 as c on a.ID=c.ID and b.event=c.event;
quit;

Edit: Or

proc sql;
select      
    a.ID, 
    b.event,
    c.value,
    c.value2
from    
    (select distinct ID from dataset2) as a cross join
    (select distinct event from dataset2 where event is not missing) as b left join
    dataset2 as c on a.ID=c.ID and b.event=c.event
union select * from dataset2 where event is missing;
quit;

to accomodate missing events.

PG

maguiremq · Posted 07-16-2021 01:14 PM

When I ran @PGStats code (pardon me if I mistyped something), I encountered the same issue that I had before I asked you about the null value for event. It gets replicated across each ID in the output.

id event value value2 
1234   . . 
1234 Month 1 . . 
1234 Month 2 . . 
1234 Month 3 14 656 
4242   . . 
4242 Month 1 . . 
4242 Month 2 . . 
4242 Month 3 . . 
4663   . . 
4663 Month 1 14 697 
4663 Month 2 . . 
4663 Month 3 . . 
5421   . . 
5421 Month 1 12 647 
5421 Month 2 . . 
5421 Month 3 . . 
5656   . . 
5656 Month 1 13 324 
5656 Month 2 . . 
5656 Month 3 . . 
6565   . . 
6565 Month 1 . . 
6565 Month 2 31 465 
6565 Month 3 . .

It's a bit odd that you have to retain that null event, but if that is what you need, here is one solution. If you need more info about the SPARSE option in PROC FREQ, look here.

I'm not the best with PROC SQL, so @PGStats (thanks for showing me how to do something like this with PROC SQL) may have a better solution after accounting for that weird requirement with your data.

data have;
infile datalines delimiter = "," missover;
input id $ event :$ value  value2;
datalines;
5656,Month 1,13,324
6565,Month 2,31,465
4663,Month 1,14,697
4242,,.,.
1234,Month 3,14,656
5421,Month 1,12,647
;
run;

proc freq data = have;
	tables event*id / 	out = have_2 (where = (event ^= "" or (event = "" and count = 1)))	  
						sparse;
run;

proc sql;
	select
				a.id,
				a.event,
				b.value,
				b.value2
	from
				have_2 as a
					left join
				have  as b
						on	a.id 	= b.id 		and
							a.event = b.event;
quit;

id event value value2 
1234 Month 1 . . 
1234 Month 2 . . 
1234 Month 3 14 656 
4242   . . 
4242 Month 1 . . 
4242 Month 2 . . 
4242 Month 3 . . 
4663 Month 1 14 697 
4663 Month 2 . . 
4663 Month 3 . . 
5421 Month 1 12 647 
5421 Month 2 . . 
5421 Month 3 . . 
5656 Month 1 13 324 
5656 Month 2 . . 
5656 Month 3 . . 
6565 Month 1 . . 
6565 Month 2 31 465 
6565 Month 3 . .

Rydhm · Posted 07-16-2021 02:50 PM

You can get the final table by following proc sql few steps.

1. Get the cartesian Product of just Id and event.

2. Merge the step1 result with the original id_all table and get the common rows.

3. Get the subset from step1 which is excluded from step2 and add it back

data event;
input event $10.;
datalines;
Month1
Month2
Month3
run;

data id_all;
infile datalines missover;
input ID event :$10. value value2;
datalines;
5656 Month1 13 324
6565 Month2 31 465
4663 Month1 14 697
4242
1234 Month3 14 656
5421 Month1 12 647
run;

proc sql;
create table subset
as
select  a.id, b.event
from id_all a,event b;
quit;

proc sql;
create table match
as
select a.id, b.event, a.value, a.value2
from id_all a
left join subset b
on a.id=b.id
and a.event=b.event
;
quit;

proc sql;
create table all
as
select * from match
union
(select id, event, . as value, . as value2 from subset
 except
 select id, event, . as value, . as value2 from match);
quit;

proc sort data=all; by id, event; run;

proc print data=all; run;

joining two tables