I have a dataset that is solely two columns, each line represents a link between two customers. I need to create a dataset that essentially groups all customers that are linked together. A customer can be linked to multiple customers, and I need all linked customers to appear on one row. The order of customers does not matter, in the end (1,2,3) is the same as (3,2,1), and I don't want duplicate groups.
data WORK.linkedcustomers;
input Cust:8. LinkedCust:8.;
datalines;
1 2
1 3
2 3
3 4
5 6
6 7
8 9
;;;;
In the above example, I need one row with customers 1,2,3,4
Another with customers 5, 6, 7
and another with customers 8 and 9
My current approach starts with dataset with 2,500 rows about, including duplicates (1,2) and (2,1) and my approach has been to join the dataset back on itself iteratively based on column 2 = column 1 to create column 3, then again. This approach leads to a ton of duplicates and my dataset on the 8th round of finding links has ballooned to 70 million, yet there are more links left to explore.
Here is the code I am using for the second round of this (to add the 2nd customer link):
proc sql;
create table rnd2end as
select rnd1.flngcustomerkey, rnd1.linkedcust, rj.linkedcust2
from firordlinks rnd1
left join
(select r1.flngcustomerkey, r1.linkedcust, r1e.linkedcust as linkedcust2
from firordlinks r1
left join firordlinks r1e
on r1.linkedcust = r1e.flngcustomerkey
where r1.flngcustomerkey ne r1e.linkedcust) rj
on rnd1.flngcustomerkey = rj.flngcustomerkey and rnd1.linkedcust = rj.linkedcust
order by rnd1.flngcustomerkey;
quit;
and the third round looks like this, I didn't write this in a macro as it seemed very complicated, and this won't work as it is ballooning to unreasonable size quickly:
proc sql;
create table rnd3 as
select rnd2.flngcustomerkey, rnd2.linkedcust, rnd2.linkedcust2, rj.linkedcust3
from rnd2end rnd2
left join
(select r2.flngcustomerkey, r2.linkedcust, r2.linkedcust2, r3.linkedcust as linkedcust3
from rnd2end r2
left join firordlinks r3
on r2.linkedcust2 = r3.flngcustomerkey
where (r2.flngcustomerkey ne r3.linkedcust) and (r2.linkedcust ne r3.linkedcust)) rj
on rnd2.flngcustomerkey = rj.flngcustomerkey and rnd2.linkedcust = rj.linkedcust and rnd2.linkedcust2 = rj.linkedcust2
order by rnd2.flngcustomerkey;
quit;
Is there a better approach to take here?
If not, you can do this. I stole the code from the great @Ksharp .
data have;
input from :8. to:8.;
datalines;
1 2
1 3
2 3
3 4
5 6
6 7
8 9
;;;;
data full;
set have end=last;
if _n_ eq 1 then do;
declare hash h();
h.definekey('node');
h.definedata('node');
h.definedone();
end;
output;
node=from; h.replace();
from=to; to=node; output;
node=from; h.replace();
if last then h.output(dataset:'node');
drop node;
run;
data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 16);
_ha.definekey('key');
_ha.definedone();
if 0 then set full;
declare hash from_to(dataset:'full',hashexp:20,multidata:'y');
from_to.definekey('from');
from_to.definedata('to');
from_to.definedone();
if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
no.definekey('node');
no.definedata('node');
no.definedone();
do while(hi_no.next()=0);
household+1; output;
count=1;
key=node;_ha.add();
last=node;ha.add();
rc=hi.first();
do while(rc=0);
from=last;rx=from_to.find();
do while(rx=0);
key=to;ry=_ha.check();
if ry ne 0 then do;
node=to;output;rr=no.remove(key:node);
key=to;_ha.add();
count+1;
last=to;ha.add();
end;
rx=from_to.find_next();
end;
rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;
Assuming you want to do some sort of link analysis may-be have a look into Proc Optgraph given that your source dataset is already in the structure suitable as input into this procedure.
https://go.documentation.sas.com/doc/en/procgralgcdc/14.2/procgralg/titlepage.htm
Proc Optgraph doesn't come with a base license so first check if it's available/licensed at your site.
If not, you can do this. I stole the code from the great @Ksharp .
data have;
input from :8. to:8.;
datalines;
1 2
1 3
2 3
3 4
5 6
6 7
8 9
;;;;
data full;
set have end=last;
if _n_ eq 1 then do;
declare hash h();
h.definekey('node');
h.definedata('node');
h.definedone();
end;
output;
node=from; h.replace();
from=to; to=node; output;
node=from; h.replace();
if last then h.output(dataset:'node');
drop node;
run;
data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 16);
_ha.definekey('key');
_ha.definedone();
if 0 then set full;
declare hash from_to(dataset:'full',hashexp:20,multidata:'y');
from_to.definekey('from');
from_to.definedata('to');
from_to.definedone();
if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
no.definekey('node');
no.definedata('node');
no.definedone();
do while(hi_no.next()=0);
household+1; output;
count=1;
key=node;_ha.add();
last=node;ha.add();
rc=hi.first();
do while(rc=0);
from=last;rx=from_to.find();
do while(rx=0);
key=to;ry=_ha.check();
if ry ne 0 then do;
node=to;output;rr=no.remove(key:node);
key=to;_ha.add();
count+1;
last=to;ha.add();
end;
rx=from_to.find_next();
end;
rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;
Take a look at Proc Optnet if you have the SAS/OR license. Something like this will do
proc optnet data_links = have
out_nodes = want;
concomp;
run;
Note that, statistically, everyone is connected to everyone by (on average) only 6 (six) jumps. Given enough data, you'll end up with a single observation containing all people.
Oh yeah, super-clusters are fun! Linkage runs forever and if you visualize them they either break the UI or render after a long wait to a big black blob.
In this case these customers won't be linked in that way. I do believe certain clusters will be large but there is no chance of these all clustering into one group due to the nature of the data.
Yeah. If you have SAS/OR ,try the following code ( I stole it from great @PGStats ).
data have;
input from to;
cards;
1 2
2 3
;
run;
/* Same code as SAS/OR */
proc optnet data_links=have out_nodes=want GRAPH_DIRECTION=UNDIRECTED;
data_links_var from=from to=to;
concomp;
run;
Catch the best of SAS Innovate 2025 — anytime, anywhere. Stream powerful keynotes, real-world demos, and game-changing insights from the world’s leading data and AI minds.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.