Solved: Re: Creating a grouper column from multiple columns, diagonal records

paulsonalec · Posted 02-06-2020 12:05 PM

Hi! given these data:

data have;
infile datalines delimiter=',';
input id1 $ id2 $;
datalines;

M,A
A,B
C,D
B,C
V,W
W,X
X,Y
Y,Z
;

I am hoping to create a grouper ID that will associate M to A, A to B, B to C, and C to D. Same thing with V to W, W to X, X to Y and Y to Z. Please see below for the data I want:

data want;
infile datalines delimiter=',';
input id1 $ id2 $ grouper;
datalines;

M,A,1
A,B,1
B,C,1
C,D,1
V,W,2
W,X,2
X,Y,2
Y,Z,2
;

So, in the output, there are 2 groups that are established. Is this possible using SAS 9.4? Please note that I cannot guarantee sort order so ideally the solution does not use first/last/lag commands.

TIA!

novinosrin · Posted 02-06-2020 03:44 PM

data have;
infile datalines delimiter=',';
input id1 $ id2 $;
datalines;
M,A
X,Y
A,B
C,D
B,C
Z,S
V,W
W,X
Y,Z
;
data want ;
 if _n_=1 then do;
  if 0 then set have; 
  dcl hash H (dataset:'have') ;
  h.definekey  ("id2") ;
  h.definedata ("id2","id1") ;
  h.definedone () ;
  dcl hash H1 (dataset:'have') ;
  h1.definekey  ("id1") ;
  h1.definedata ("id1","id2") ;
  h1.definedone () ;
 end;
 set have( rename=(id1=v1 id2=v2)) ;
 array t(999) $ _temporary_;
 if v1 not in t and v2 not in t;
 grp+1;
 id1=v1;
 id2=v2;
 output;
 do while(h1.find(key:v2)=0);
  output;
  if v2 not in t then do;
   _iorc_+1;
   t(_iorc_)=v2;
  end;
  v2=id2;
 end;
 do while(h.find(key:v1)=0);
  output;
  if v1 not in t then do;
   _iorc_+1;
   t(_iorc_)=v1;
  end;
  v1=id1;
 end;
 drop v:;
run;

View solution in original post

ed_sas_member · Posted 02-06-2020 12:08 PM

Hi @paulsonalec

How would you describe the rule here:

for example, two records belong to the same group if the last value of id2= the value of id1.

is that right?

data want;
	set have;
	_lag = lag(ID2);
	if _lag ne ID1 then count+1;
	retain count;
	drop _lag;
run;

paulsonalec · Posted 02-06-2020 12:10 PM

Thanks! I should have included in my post, but the sort order is not guaranteed, so I don't think first/last functions will work if that makes sense. I will edit my post

ed_sas_member · Posted 02-06-2020 12:14 PM

Hi @paulsonalec

Thanks for the prompt answer.

I have updated my post too 🙂

paulsonalec · Posted 02-06-2020 12:18 PM

Thanks! Again, I really wish I would have included this detail in the initial post - sort order cannot be guaranteed. Please see the updated post, which includes the value 'M' related to 'A'. Further, I think this will create 3 groups assuming my first set of data - group 0, 1 & 2, right? Thanks!

ed_sas_member · Posted 02-06-2020 12:20 PM

Hi @paulsonalec

Here is the output with the same code:

-> 2 groups are created as the condition is based on the lag value of ID2.

Best,

paulsonalec · Posted 02-06-2020 12:23 PM

Right, sorry for the confusion. I guess I'm interested in this example, since sort order cannot be guaranteed here (see how I've changed row order of 'C,D' and 'B,C'). Again, I'm sorry for being annoying/not including this detail in the beginning:

data have;
infile datalines delimiter=',';
input id1 $ id2 $;
datalines;
M,A
A,B
C,D
B,C
V,W
W,X
X,Y
Y,Z
;
data want;
	set have;
	_lag = lag(ID2);
	if _lag ne ID1 then count+1;
	retain count;
	drop _lag;
run;

ed_sas_member · Posted 02-06-2020 12:34 PM

Hi @paulsonalec

It seems to be better if you run a proc sort before the data step:

data have;
infile datalines delimiter=',';
input id1 $ id2 $;
datalines;
M,A
X,Y
A,B
C,D
B,C
V,W
W,X
Y,Z
;

proc sort data=have out=have_sorted;
	by id2 id1;
run;
data want;
	set have_sorted;
	_lag = lag(ID2);
	if _lag ne ID1 then count+1;
	retain count;
	drop _lag;
run;

paulsonalec · Posted 02-06-2020 12:42 PM

Thanks, again, lag/sort isn't an option here unfortunately. Consider this possibility (I've added row Z,S):

data have;
infile datalines delimiter=',';
input id1 $ id2 $;
datalines;
M,A
X,Y
A,B
C,D
B,C
Z,S
V,W
W,X
Y,Z
;

proc sort data=have out=have_sorted;
	by id2 id1;
run;

data want;
	set have_sorted;
	_lag = lag(ID2);
	if _lag ne ID1 then count+1;
	retain count;
	drop _lag;
run;

Reeza · Posted 02-06-2020 12:56 PM

You need a hash solution that's recursive to find your groups.

I believe this solution works or what you need, run the example to see if I'm understanding your problem correctly first. It helps if you make sure your input data is as much related to your input data, ie if sort isn't defined don't show it as sorted. The hash solution doesn't depend on any order.
https://gist.github.com/statgeek/14e3aa2a9f718f551cd98134e9ceed30

There are a few users on here who can develop a solution related to this problem, I'm not one of them 😞

paulsonalec · Posted 02-06-2020 01:49 PM

Thanks, this appears responsive. I really appreciate the help!

novinosrin · Posted 02-06-2020 03:44 PM

data have;
infile datalines delimiter=',';
input id1 $ id2 $;
datalines;
M,A
X,Y
A,B
C,D
B,C
Z,S
V,W
W,X
Y,Z
;
data want ;
 if _n_=1 then do;
  if 0 then set have; 
  dcl hash H (dataset:'have') ;
  h.definekey  ("id2") ;
  h.definedata ("id2","id1") ;
  h.definedone () ;
  dcl hash H1 (dataset:'have') ;
  h1.definekey  ("id1") ;
  h1.definedata ("id1","id2") ;
  h1.definedone () ;
 end;
 set have( rename=(id1=v1 id2=v2)) ;
 array t(999) $ _temporary_;
 if v1 not in t and v2 not in t;
 grp+1;
 id1=v1;
 id2=v2;
 output;
 do while(h1.find(key:v2)=0);
  output;
  if v2 not in t then do;
   _iorc_+1;
   t(_iorc_)=v2;
  end;
  v2=id2;
 end;
 do while(h.find(key:v1)=0);
  output;
  if v1 not in t then do;
   _iorc_+1;
   t(_iorc_)=v1;
  end;
  v1=id1;
 end;
 drop v:;
run;

paulsonalec · Posted 02-06-2020 03:52 PM

This is unbelievable, thank you!

Ksharp · Posted 02-06-2020 11:47 PM

It is one to one match or one to many match ?


data have;
infile cards ;
input from $  to $ ;
cards;
1     2
1     3
4     5
5     2
9     4
6     7
8     7
;
run;
data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
    h.definekey('node');
     h.definedata('node');
     h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node;
  output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;


data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
 from_to.definekey('from');
 from_to.definedata('to');
 from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
 no.definekey('node');
 no.definedata('node');
 no.definedone();
 

do while(hi_no.next()=0);
 household+1; output;
 count=1;
 key=node;_ha.add();
 last=node;ha.add();
 rc=hi.first();
 do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
     key=to;ry=_ha.check();
      if ry ne 0 then do;
       node=to;output;rr=no.remove(key:node);
       key=to;_ha.add();
       count+1;
       last=to;ha.add();
      end;
      rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;

novinosrin · Posted 02-07-2020 07:38 AM

Very good curiosity and thinking @Ksharp

SAS Innovate 2025: Register Now

SAS Training: Just a Click Away