I have a dataset that is solely two columns, each line represents a link between two customers. I need to create a dataset that essentially groups all customers that are linked together. A customer can be linked to multiple customers, and I need all linked customers to appear on one row. The order of customers does not matter, in the end (1,2,3) is the same as (3,2,1), and I don't want duplicate groups. data WORK.linkedcustomers; input Cust:8. LinkedCust:8.; datalines; 1 2 1 3 2 3 3 4 5 6 6 7 8 9 ;;;; In the above example, I need one row with customers 1,2,3,4 Another with customers 5, 6, 7 and another with customers 8 and 9 My current approach starts with dataset with 2,500 rows about, including duplicates (1,2) and (2,1) and my approach has been to join the dataset back on itself iteratively based on column 2 = column 1 to create column 3, then again. This approach leads to a ton of duplicates and my dataset on the 8th round of finding links has ballooned to 70 million, yet there are more links left to explore. Here is the code I am using for the second round of this (to add the 2nd customer link): proc sql;
create table rnd2end as
select rnd1.flngcustomerkey, rnd1.linkedcust, rj.linkedcust2
from firordlinks rnd1
left join
(select r1.flngcustomerkey, r1.linkedcust, r1e.linkedcust as linkedcust2
from firordlinks r1
left join firordlinks r1e
on r1.linkedcust = r1e.flngcustomerkey
where r1.flngcustomerkey ne r1e.linkedcust) rj
on rnd1.flngcustomerkey = rj.flngcustomerkey and rnd1.linkedcust = rj.linkedcust
order by rnd1.flngcustomerkey;
quit; and the third round looks like this, I didn't write this in a macro as it seemed very complicated, and this won't work as it is ballooning to unreasonable size quickly: proc sql;
create table rnd3 as
select rnd2.flngcustomerkey, rnd2.linkedcust, rnd2.linkedcust2, rj.linkedcust3
from rnd2end rnd2
left join
(select r2.flngcustomerkey, r2.linkedcust, r2.linkedcust2, r3.linkedcust as linkedcust3
from rnd2end r2
left join firordlinks r3
on r2.linkedcust2 = r3.flngcustomerkey
where (r2.flngcustomerkey ne r3.linkedcust) and (r2.linkedcust ne r3.linkedcust)) rj
on rnd2.flngcustomerkey = rj.flngcustomerkey and rnd2.linkedcust = rj.linkedcust and rnd2.linkedcust2 = rj.linkedcust2
order by rnd2.flngcustomerkey;
quit; Is there a better approach to take here?
... View more