BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
econtax
Calcite | Level 5

I have a dataset that is solely two columns, each line represents a link between two customers. I need to create a dataset that essentially groups all customers that are linked together. A customer can be linked to multiple customers, and I need all linked customers to appear on one row. The order of customers does not matter, in the end (1,2,3) is the same as (3,2,1), and I don't want duplicate groups.

 

data WORK.linkedcustomers;
input Cust:8. LinkedCust:8.;
datalines;
1 2
1 3
2 3
3 4
5 6
6 7
8 9
;;;;

In the above example, I need one row with customers 1,2,3,4

Another with customers 5, 6, 7

and another with customers 8 and 9

My current approach starts with dataset with 2,500 rows about, including duplicates (1,2) and (2,1) and my approach has been to join the dataset back on itself iteratively based on column 2 = column 1 to create column 3, then again. This approach leads to a ton of duplicates and my dataset on the 8th round of finding links has ballooned to 70 million, yet there are more links left to explore. 

 

Here is the code I am using for the second round of this (to add the 2nd customer link):

proc sql;
create table rnd2end as
select rnd1.flngcustomerkey, rnd1.linkedcust, rj.linkedcust2
from firordlinks rnd1
left join
(select r1.flngcustomerkey, r1.linkedcust, r1e.linkedcust as linkedcust2
from firordlinks r1
left join firordlinks r1e
on r1.linkedcust = r1e.flngcustomerkey
where r1.flngcustomerkey ne r1e.linkedcust) rj
on rnd1.flngcustomerkey = rj.flngcustomerkey and rnd1.linkedcust = rj.linkedcust
order by rnd1.flngcustomerkey;
quit;

 

and the third round looks like this, I didn't write this in a macro as it seemed very complicated, and this won't work as it is ballooning to unreasonable size quickly:

proc sql;
create table rnd3 as
select rnd2.flngcustomerkey, rnd2.linkedcust, rnd2.linkedcust2, rj.linkedcust3
from rnd2end rnd2
left join
(select r2.flngcustomerkey, r2.linkedcust, r2.linkedcust2, r3.linkedcust as linkedcust3
from rnd2end r2
left join firordlinks r3
on r2.linkedcust2 = r3.flngcustomerkey
where (r2.flngcustomerkey ne r3.linkedcust) and (r2.linkedcust ne r3.linkedcust)) rj
on rnd2.flngcustomerkey = rj.flngcustomerkey and rnd2.linkedcust = rj.linkedcust and rnd2.linkedcust2 = rj.linkedcust2
order by rnd2.flngcustomerkey;
quit;

 

Is there a better approach to take here?

1 ACCEPTED SOLUTION

Accepted Solutions
PeterClemmensen
Tourmaline | Level 20

If not, you can do this. I stole the code from the great @Ksharp .

 

data have;
input from :8. to:8.;
datalines;
1 2 
1 3 
2 3 
3 4 
5 6
6 7
8 9
;;;;


data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
    h.definekey('node');
     h.definedata('node');
     h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node; output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;


data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 16);
_ha.definekey('key');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full',hashexp:20,multidata:'y');
 from_to.definekey('from');
 from_to.definedata('to');
 from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
 no.definekey('node');
 no.definedata('node');
 no.definedone();
 

do while(hi_no.next()=0);
 household+1; output;
 count=1;
 key=node;_ha.add();
 last=node;ha.add();
 rc=hi.first();
 do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
     key=to;ry=_ha.check();
      if ry ne 0 then do;
       node=to;output;rr=no.remove(key:node);
       key=to;_ha.add();
       count+1;
       last=to;ha.add();
      end;
      rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;

View solution in original post

8 REPLIES 8
Patrick
Opal | Level 21

Assuming you want to do some sort of link analysis may-be have a look into Proc Optgraph given that your source dataset is already in the structure suitable as input into this procedure. 

https://go.documentation.sas.com/doc/en/procgralgcdc/14.2/procgralg/titlepage.htm 

 

Proc Optgraph doesn't come with a base license so first check if it's available/licensed at your site.

PeterClemmensen
Tourmaline | Level 20

If not, you can do this. I stole the code from the great @Ksharp .

 

data have;
input from :8. to:8.;
datalines;
1 2 
1 3 
2 3 
3 4 
5 6
6 7
8 9
;;;;


data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
    h.definekey('node');
     h.definedata('node');
     h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node; output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;


data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 16);
_ha.definekey('key');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full',hashexp:20,multidata:'y');
 from_to.definekey('from');
 from_to.definedata('to');
 from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
 no.definekey('node');
 no.definedata('node');
 no.definedone();
 

do while(hi_no.next()=0);
 household+1; output;
 count=1;
 key=node;_ha.add();
 last=node;ha.add();
 rc=hi.first();
 do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
     key=to;ry=_ha.check();
      if ry ne 0 then do;
       node=to;output;rr=no.remove(key:node);
       key=to;_ha.add();
       count+1;
       last=to;ha.add();
      end;
      rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;
PeterClemmensen
Tourmaline | Level 20

Take a look at Proc Optnet if you have the SAS/OR license. Something like this will do

 

proc optnet data_links = have
            out_nodes  = want;
   concomp;
run;
econtax
Calcite | Level 5
Unfortunately I only have access to base SAS and changing that would involve significant bureaucracy.
Kurt_Bremser
Super User

Note that, statistically, everyone is connected to everyone by (on average) only 6 (six) jumps. Given enough data, you'll end up with a single observation containing all people.

Patrick
Opal | Level 21

Oh yeah, super-clusters are fun! Linkage runs forever and if you visualize them they either break the UI or render after a long wait to a big black blob.

econtax
Calcite | Level 5

In this case these customers won't be linked in that way. I do believe certain clusters will be large but there is no chance of these all clustering into one group due to the nature of the data. 

Ksharp
Super User

Yeah. If you have SAS/OR ,try the following code ( I stole it from great @PGStats ).

 

data have;
input from to;
cards;
1 2
2 3
;
run;


/* Same code as SAS/OR */
proc optnet data_links=have out_nodes=want GRAPH_DIRECTION=UNDIRECTED;
data_links_var from=from to=to;
concomp;
run;

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 852 views
  • 5 likes
  • 5 in conversation