Solved: Re: How to combine ten data sets using PROC SQL efficiently

shihabur · Posted 10-16-2017 07:33 PM

I have ten data sets and all of them has customer_id variable in common. I know if we have three data sets we can do something like this

proc sql;
create table example as
select a.ID, b.*, c.* from
temp a full join temp2 b
on a.id = b.id
full join temp3 c
on a.id = c.id;
quit;

But for ten or more dataset it will get really complex. So is there any other efficient way to do this ?

LinusH · Posted 10-17-2017 02:08 PM

17, 18, 19, 20 & 22 is missing just in the idno column, because you are omitting the i.id and j.id in your coalesce argument.

SQL doen't necessarily output in sorted order, even on the join column.

Add an ORDER BY clause to have in the order you want.

Data never sleeps

View solution in original post

Reeza · Posted 10-16-2017 07:54 PM

Try a data step merge instead?

shihabur · Posted 10-17-2017 01:58 PM

Despite having to sort every data set, merging is providing the desired result for me. Below is the code I tried to compare results between proc sql and data step merging.

/*created 10 data sets*/

data temp1;
input id x1 x2;
cards;
1 25 37
2 35 .
3 44 97
;
run;

data temp2;
input id var1 var2;
cards;
2 65 37
3 .  47
5 34 97
;
run;

data temp3;
input id xx1 xx2;
cards;
3 55 37
5 25 47
4 .  97
;
run;

data temp4;
input id var3 var4;
cards;
6 25 37
7 35 47
9 .  .
;
run;

data temp5;
input id var5 var6;
cards;
5 65 37
6 85 47
9 34 97
;
run;

data temp6;
input id xx3 xx4;
cards;
11 55 37
13 25 47
14 64 97
;
run;

data temp7;
input id xx5 xx6;
cards;
11 25 .
15 35 47
16 44 97
;
run;

data temp8;
input id x3 x4;
cards;
12 65 .
15 85 47
16 .  97
;
run;

data temp9;
input id x5 x6;
cards;
17 55 37
18 25 47
19 64 97
;
run;

data temp10;
input id x7 x8;
cards;
17 .  .
20 25 37
22 24 97
;
run;

/*not including the proc sort here */

data new;
merge temp1 temp2 temp3 temp4 temp5 temp6 temp7 temp8 temp9 temp10;
by id;
run;

proc print data = new;
title 'merged 10 data sets';
run;

The above code produces below output, which seems like what I want.

But the PROC SQL output looks messed up. I ran the same code for joining 3 tables and then the output was fine. But for 10 tables the output looks very weird. I am just experimenting with proc sql and want to know how to produce similar output like the data step merge.

title "proc sql full join";
proc sql;
select coalesce(a.id, b.id, c.id, d.id, e.id, f.id, g.id, h.id) as idno, a.*, b.*, c.*, d.*, e.*, f.*, g.*, h.* , i.*, j.*
from temp1 a Full Join temp2 b on a.id = b.id
Full Join temp3 c on b.id = c.id
Full Join temp4 d on c.id = d.id
Full Join temp5 e on d.id = e.id
Full Join temp6 f on e.id = f.id
Full Join temp7 g on f.id = g.id
Full Join temp8 h on g.id = h.id
Full Join temp9 i on h.id = i.id
Full Join temp10 j on i.id = j.id;
quit;

and the output is below.The idno column is not sorted in ascending order and it is also omitting some of the id numbers from 17,18,19,20,22 . Also idno 5 is repeated twice. Can you please tell me what's wrong here ?

LinusH · Posted 10-17-2017 02:08 PM

17, 18, 19, 20 & 22 is missing just in the idno column, because you are omitting the i.id and j.id in your coalesce argument.

SQL doen't necessarily output in sorted order, even on the join column.

Add an ORDER BY clause to have in the order you want.

Data never sleeps

shihabur · Posted 10-17-2017 02:31 PM

Thanks a lot. Made those changes (adding i.id and j.id in the coalesce and adding order by) and the output looks much better. Only one question, why idno 5 is coming up twice ?

LinusH · Posted 10-17-2017 05:31 AM

Please define "efficient"? It sounds like you mean less complex. And yes, it might be easier with a data step merge as @Reeza suggests. Bare in mind you might need to pre-sort the data sets - all need to be sorted on the BY variable which is not a requirement for SQL joins).

Data never sleeps

shihabur · Posted 10-17-2017 12:32 PM

@LinusHyou are right. I meant less complex.

Registration is open

SAS Training: Just a Click Away