Combine rows of data with various connections across two columns

Reply
New Contributor
Posts: 3

Combine rows of data with various connections across two columns

Hi all,

 

I'm at a loss on how to merge/combine rows effectively when there are multiple overlapping connections. 

Essentially explained verbally:  what I am trying to do is to take A and B and rename them by the first of the A's in the group. The problem is, A1= B1 and B2 but B2= A1 and A2. Thus, the result should just say A1= the names of B1 B2 and A2. These connections occur over multiple rows. 

 

In an example: 

I have: 

ID1_AID2_BName
11alpha
12bravo
13charlie
22bravo
23delta
34echo

 

I want:  

ID_combinedName_combined
1alpha bravo charlie bravo delta
3echo

 

thanks! I'm on SAS university edition

Esteemed Advisor
Posts: 6,651

Re: Combine rows of data with various connections across two columns

What constitutes a group? At the moment I have no idea why you combine the first 5 observations, but start a new group with observation 6.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Grand Advisor
Posts: 9,576

Re: Combine rows of data with various connections across two columns


I don't know why A3 wouldn't be include ,since

1	3	charlie
3	4	echo


1->3->4
New Contributor
Posts: 3

Re: Combine rows of data with various connections across two columns

A3 is not included because the two IDs are seperate from each other. If A3 had equaled B1, B2, or B3 then it would have been included in the grouping. However, since B4 is a new ID, it is not included. 

 

Another example would be: 

ID1_AID2_BName
redapplealpha
redstrawberrybravo
greenapplecharlie
greenstrawberrybravo
greenbananadelta
purpleeggplantecho

 

becomes: 

ID_combinedName_combined
redalpha bravo charlie bravo delta
purpleecho
Esteemed Advisor
Posts: 7,293

Re: Combine rows of data with various connections across two columns

[ Edited ]

There may be a more direct way, but the following appears to meet your specifications:

 

data have;
  infile cards dlm='09'x;
  input ID1_A	ID2_B	Name $;
  cards;
1	1	alpha
1	2	bravo
1	3	charlie
2	2	bravo
2	3	delta
3	4	echo
;

data need1;
  set have;
  recnum=_n_;
run;

proc sql;
  create table need2 as
    select recnum,count(*) as count
      from need1
        group by id2_b
          order by recnum
  ;
quit;

data need2 (rename=(id1_a=ID_combined));
  merge need1 need2;
  by recnum;
  retain hold_id1_a;
  if _n_ eq 1 then hold_id1_a=id1_a;
  else if count gt 1 then id1_a=hold_id1_a;
run;

proc transpose data=need2 out=want (drop=_:);
  by ID_combined;
  var Name;
run;

data want (drop=col:);
  set want;
  name_combined=catx(' ',of col:);
run;

Art, CEO, AnalystFinder.com

New Contributor
Posts: 3

Re: Combine rows of data with various connections across two columns

That does work! Thank you!

Do you know how you could make it into kind of a decoder key, so that you would know that IDA equals the following IDBs and IDAs that were clustered into that group? For instance with your data input it would become: 

Grouped ID AB
111
122
1 3
   
Esteemed Advisor
Posts: 7,293

Re: Combine rows of data with various connections across two columns

I think you have to more clearly describe what you want. Might just need to create a format. However, I'd pose the question as a new thread to increase the number of people responding.

 

Art, CEO, AnalystFinder.com

Super User
Posts: 789

Re: Combine rows of data with various connections across two columns

This is a good task to take advantage of the default attributes of hash object (hash table).

 

The data step below declares a hash table with 2 variables (ID1_A and ID2_B), keyed on ID1_A.  Because the default mode of hash tables is to keep only one item (i.e. one row) per key (one per ID2_B in your case), and that one item defaults to the first one encountered, you can just output the hash object to a sas dataset to get what you want, namely each ID2_B and the earliest associated ID1A.

 

data _null_;

  declare hash h (dataset:'have (keep=id1_a id2_b)');

    h.definekey('id2_b');

    h.definedata(all:'Y');

  rc=h.output(dataset:'decoder');

run;

 

proc sort data=decoder;

  by id1_a id2_b;

run;

Super User
Posts: 789

Re: Combine rows of data with various connections across two columns

I think you want

 

  1. To form a group, identified by ID1_A, of all the ID2_B's that are linked to that ID1_A, but only if the ID2_B has not been already linked with an earlier ID1_A.
  2. For that group of ID2_B's, get a list of all unique names to assoicate with the group identifier (ID1_A):

 

data have;
   input ID1_A ID2_B Name :$8.;
datalines;
1 1 alpha 
1 2 bravo 
1 3 charlie 
2 2 bravo 
2 3 delta 
3 4 echo 
run;

data need (keep=id1_a name);
  set have;

  if _n_=1 then do;
    declare hash names();  /* Running table of names encountered*/
      names.definekey('name');
      names.definedone();

    declare hash b_to_a(); /* Map from ID2_B to first linked ID1_A*/
      b_to_a.definekey('id2_b');
      b_to_a.definedata('id1_a');
      b_to_a.definedone();
  end;

  rc=names.find();
  if rc=0 then delete;

  /* Only new names get to this part*/
  rc=names.add();                /* Add it to the table of names */
  rc=b_to_a.find();              /* Find previously associated ID1_a if any*/
  if rc^=0 then rc=b_to_a.add(); /*Add new ID2_B link if not already there */
run;

proc sort; by id_1a;run;

data want;
  set need;
  by id1_a;

  length name_list $200;
  retain name_list;

  if  first.id1_a then name_list=name;
  else name_list=catx(' ',name_list,name);
  if last.id1_a ;
run;

 

The idea here is to maintain a hash lookup table linking ID2_B to it's first linked ID1_A, and only that ID1_A.  Then it's easy to make a dataset of ID1_A and each corresponding name.  Sort them and compile into a list.

Grand Advisor
Posts: 9,576

Re: Combine rows of data with various connections across two columns

It is really really not easy. After got the result (final_want table), it is easy to get the ID1_A you want .

 

data x;
   input ID1_A ID2_B Name $;
datalines;
1 1 alpha 
1 2 bravo 
1 3 charlie 
2 2 bravo 
2 3 delta 
2 5 xx
2 6 zz
3 4 echo 
4 6 yy
5 11 ee
5 22 rr
6 22 gg
6 44 uu
7 88 ii
7 99 oo
;
run;
data key;
array x{99999} _temporary_;
do i=1 by 1 until(last.id1_a);
 set x;
 by id1_a;
 x{i}=id2_b;
end;

do m=1 to i;
 from=x{m};
 do n=m+1 to i;
  to=x{n};output;
 end;
end;
keep from to;
run;

proc sort data=key out=have nodupkey;
 by from to;
run;



data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
    h.definekey('node');
     h.definedata('node');
     h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node;
  output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;


data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
 from_to.definekey('from');
 from_to.definedata('to');
 from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
 no.definekey('node');
 no.definedata('node');
 no.definedone();
 

do while(hi_no.next()=0);
 household+1; output;
 count=1;
 key=node;_ha.add();
 last=node;ha.add();
 rc=hi.first();
 do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
     key=to;ry=_ha.check();
      if ry ne 0 then do;
       node=to;output;rr=no.remove(key:node);
       key=to;_ha.add();
       count+1;
       last=to;ha.add();
      end;
      rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
call symputx('n',household);
stop;
run;
%put &n;
data want1;
 if _n_=1 then do;
  if 0 then set want;
  declare hash h(dataset:'want');
  h.definekey('node');
  h.definedata('household');
  h.definedone();
 end;
retain n &n;
set x;
call missing(household);
if h.find(key:id2_b) ne 0 then do;n+1;household=n;end;
keep name household;
run;
proc sort data=want1 out=want2;
 by household;
run;
data final_want;
length want $ 200;
do until(last.household);
 set want2;
 by household;
 want=catx(' ',want,name);
end;
drop household name;
run;
Ask a Question
Discussion stats
  • 9 replies
  • 198 views
  • 1 like
  • 5 in conversation