Re: Combine rows of data with various connections across two columns

jhzsquared · Posted 02-21-2017 12:53 AM

Hi all,

I'm at a loss on how to merge/combine rows effectively when there are multiple overlapping connections.

Essentially explained verbally: what I am trying to do is to take A and B and rename them by the first of the A's in the group. The problem is, A1= B1 and B2 but B2= A1 and A2. Thus, the result should just say A1= the names of B1 B2 and A2. These connections occur over multiple rows.

In an example:

I have:

ID1_A	ID2_B	Name
1	1	alpha
1	2	bravo
1	3	charlie
2	2	bravo
2	3	delta
3	4	echo

I want:

ID_combined	Name_combined
1	alpha bravo charlie bravo delta
3	echo

thanks! I'm on SAS university edition

Kurt_Bremser · Posted 02-21-2017 02:03 AM

What constitutes a group? At the moment I have no idea why you combine the first 5 observations, but start a new group with observation 6.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Ksharp · Posted 02-21-2017 03:42 AM


I don't know why A3 wouldn't be include ,since

1	3	charlie
3	4	echo


1->3->4

jhzsquared · Posted 02-21-2017 08:08 AM

A3 is not included because the two IDs are seperate from each other. If A3 had equaled B1, B2, or B3 then it would have been included in the grouping. However, since B4 is a new ID, it is not included.

Another example would be:

ID1_A	ID2_B	Name
red	apple	alpha
red	strawberry	bravo
green	apple	charlie
green	strawberry	bravo
green	banana	delta
purple	eggplant	echo

becomes:

ID_combined	Name_combined
red	alpha bravo charlie bravo delta
purple	echo

art297 · Posted 02-21-2017 08:32 AM

There may be a more direct way, but the following appears to meet your specifications:

data have;
  infile cards dlm='09'x;
  input ID1_A	ID2_B	Name $;
  cards;
1	1	alpha
1	2	bravo
1	3	charlie
2	2	bravo
2	3	delta
3	4	echo
;

data need1;
  set have;
  recnum=_n_;
run;

proc sql;
  create table need2 as
    select recnum,count(*) as count
      from need1
        group by id2_b
          order by recnum
  ;
quit;

data need2 (rename=(id1_a=ID_combined));
  merge need1 need2;
  by recnum;
  retain hold_id1_a;
  if _n_ eq 1 then hold_id1_a=id1_a;
  else if count gt 1 then id1_a=hold_id1_a;
run;

proc transpose data=need2 out=want (drop=_:);
  by ID_combined;
  var Name;
run;

data want (drop=col:);
  set want;
  name_combined=catx(' ',of col:);
run;

Art, CEO, AnalystFinder.com

jhzsquared · Posted 02-22-2017 01:56 PM

That does work! Thank you!

Do you know how you could make it into kind of a decoder key, so that you would know that IDA equals the following IDBs and IDAs that were clustered into that group? For instance with your data input it would become:

Grouped ID	A	B
1	1	1
1	2	2
1		3

art297 · Posted 02-22-2017 02:03 PM

I think you have to more clearly describe what you want. Might just need to create a format. However, I'd pose the question as a new thread to increase the number of people responding.

Art, CEO, AnalystFinder.com

mkeintz · Posted 02-22-2017 02:15 PM

This is a good task to take advantage of the default attributes of hash object (hash table).

The data step below declares a hash table with 2 variables (ID1_A and ID2_B), keyed on ID1_A. Because the default mode of hash tables is to keep only one item (i.e. one row) per key (one per ID2_B in your case), and that one item defaults to the first one encountered, you can just output the hash object to a sas dataset to get what you want, namely each ID2_B and the earliest associated ID1A.

data _null_;

declare hash h (dataset:'have (keep=id1_a id2_b)');

h.definekey('id2_b');

h.definedata(all:'Y');

rc=h.output(dataset:'decoder');

run;

proc sort data=decoder;

by id1_a id2_b;

run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

mkeintz · Posted 02-21-2017 05:23 PM

I think you want

To form a group, identified by ID1_A, of all the ID2_B's that are linked to that ID1_A, but only if the ID2_B has not been already linked with an earlier ID1_A.
For that group of ID2_B's, get a list of all unique names to assoicate with the group identifier (ID1_A):

data have;
   input ID1_A ID2_B Name :$8.;
datalines;
1 1 alpha 
1 2 bravo 
1 3 charlie 
2 2 bravo 
2 3 delta 
3 4 echo 
run;

data need (keep=id1_a name);
  set have;

  if _n_=1 then do;
    declare hash names();  /* Running table of names encountered*/
      names.definekey('name');
      names.definedone();

    declare hash b_to_a(); /* Map from ID2_B to first linked ID1_A*/
      b_to_a.definekey('id2_b');
      b_to_a.definedata('id1_a');
      b_to_a.definedone();
  end;

  rc=names.find();
  if rc=0 then delete;

  /* Only new names get to this part*/
  rc=names.add();                /* Add it to the table of names */
  rc=b_to_a.find();              /* Find previously associated ID1_a if any*/
  if rc^=0 then rc=b_to_a.add(); /*Add new ID2_B link if not already there */
run;

proc sort; by id_1a;run;

data want;
  set need;
  by id1_a;

  length name_list $200;
  retain name_list;

  if  first.id1_a then name_list=name;
  else name_list=catx(' ',name_list,name);
  if last.id1_a ;
run;

The idea here is to maintain a hash lookup table linking ID2_B to it's first linked ID1_A, and only that ID1_A. Then it's easy to make a dataset of ID1_A and each corresponding name. Sort them and compile into a list.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Ksharp · Posted 02-22-2017 12:13 AM

It is really really not easy. After got the result (final_want table), it is easy to get the ID1_A you want .

data x;
   input ID1_A ID2_B Name $;
datalines;
1 1 alpha 
1 2 bravo 
1 3 charlie 
2 2 bravo 
2 3 delta 
2 5 xx
2 6 zz
3 4 echo 
4 6 yy
5 11 ee
5 22 rr
6 22 gg
6 44 uu
7 88 ii
7 99 oo
;
run;
data key;
array x{99999} _temporary_;
do i=1 by 1 until(last.id1_a);
 set x;
 by id1_a;
 x{i}=id2_b;
end;

do m=1 to i;
 from=x{m};
 do n=m+1 to i;
  to=x{n};output;
 end;
end;
keep from to;
run;

proc sort data=key out=have nodupkey;
 by from to;
run;



data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
    h.definekey('node');
     h.definedata('node');
     h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node;
  output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;


data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
 from_to.definekey('from');
 from_to.definedata('to');
 from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
 no.definekey('node');
 no.definedata('node');
 no.definedone();
 

do while(hi_no.next()=0);
 household+1; output;
 count=1;
 key=node;_ha.add();
 last=node;ha.add();
 rc=hi.first();
 do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
     key=to;ry=_ha.check();
      if ry ne 0 then do;
       node=to;output;rr=no.remove(key:node);
       key=to;_ha.add();
       count+1;
       last=to;ha.add();
      end;
      rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
call symputx('n',household);
stop;
run;
%put &n;
data want1;
 if _n_=1 then do;
  if 0 then set want;
  declare hash h(dataset:'want');
  h.definekey('node');
  h.definedata('household');
  h.definedone();
 end;
retain n &n;
set x;
call missing(household);
if h.find(key:id2_b) ne 0 then do;n+1;household=n;end;
keep name household;
run;
proc sort data=want1 out=want2;
 by household;
run;
data final_want;
length want $ 200;
do until(last.household);
 set want2;
 by household;
 want=catx(' ',want,name);
end;
drop household name;
run;

Registration is open

SAS Training: Just a Click Away