BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Sejin
Obsidian | Level 7

Hello. I have a trouble in assiging same id by group.

 

I have a dataset that looks like this.

 

obsvar1var2
1xa
2xb
3yc
4yd
5ye
6yb
7zf
8zg
9zh
10zi

 

I want to assign codes based on the var1. So obs 1, 2 should be assigned to the same id because they share "x" value on var1.

However, I also have another rule which is assigning same id if observations share same value of var2. For example, following this rule, obs 1-6 should be assigned same id because obs 6 has "b" on var2 which is same as that of obs2.

 

Could anybody help me find a code to solve the problem?

Thanks for reading the thread.

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User

If you do not have SAS/OR. Try my code .

 

data have ;
infile cards expandtabs truncover;
input obs	from $ to $;
drop obs;
cards;
1	a	x
2	a	y
3	b	x
4	b	 
5	b	z
6	c	u
7	c	v
8	c	w
9	d	 
10	d	m
;
run;




data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
    h.definekey('node');
     h.definedata('node');
     h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node;
  output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;


data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
 from_to.definekey('from');
 from_to.definedata('to');
 from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
 no.definekey('node');
 no.definedata('node');
 no.definedone();
 

do while(hi_no.next()=0);
 household+1; output;
 count=1;
 key=node;_ha.add();
 last=node;ha.add();
 rc=hi.first();
 do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
     key=to;ry=_ha.check();
      if ry ne 0 then do;
       node=to;output;rr=no.remove(key:node);
       key=to;_ha.add();
       count+1;
       last=to;ha.add();
      end;
      rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;

proc sql;
create table key as
 select * from want
  where node in (select from from have);
create table final_want as
 select a.*,b.household from have as a left join key as b
  on a.from=key.node;
quit;

proc print;run;

View solution in original post

8 REPLIES 8
mkeintz
PROC Star

You have stated an example of a rule that says:

  1. obs 1 and 2 get ID=1 since they both have var1=X
  2. obs 6 gets ID=1 since it shares var2=b with obs 2

then does the rule also imply that obs 3,4, and 5 also get ID=1 since they share var1=Y with obs 6?

 

How recursive is your linking rule?  Is an ID group comprised of all obs that share either var1 or var2 with at least one other group member?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Sejin
Obsidian | Level 7

The way you understood is absoulely true. Rule 1 and 2 imply that obs 1 through 6 should be assigned the same code.

Also, you can summarize that members of a group that has been assigned the same code share either same var1 or var2 wih at least one of the members.

mkeintz
PROC Star

Do you have SAS/OR  /ETS?  PROC OPTNET, which is in sas/or, has a solution for this, using the CONCOMP (connected components) statement.

 

data have;
  input var1 :$1. var2 :$1. @@;
datalines;
x a  x b  y c  y d  y e  y b  z f  z g  z h  z i
run;

data need;
  set have;
  from=cats('_',var1);
  to=cats('-',var2);
run;

proc optnet data_links=need out_nodes=ON (index=(node));
  concomp;
run;

data want ;
  set have ;
  node=cats('_',var1);
  set on key=node / unique;
  drop node;
run;

 

Data set NEED is created to prefix '_' to all var1 values, and prefix '-' to all var2 values, so that otherwise identical "node" values in var1 and var2 are not seen as the same node.  The CONCOMP statement in proc optnet assign a connected-component id - take a look at the ON dataset.

 

But I suspect you need to work through this problem using the data step.  What have you tried so far?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
PGStats
Opal | Level 21

Your problem amounts to finding all connected components in a network (or a graph). This problem is addressed by proc optnet. If you don't have a SAS/OR licence, you can use my subgraphs macro available here to find the components. Here's how to do it:

 

data have;
input obs var1 $ var2 $; 
datalines;
1 x a 
2 x b 
3 y c 
4 y d 
5 y e 
6 y b 
7 z f 
8 z g 
9 z h 
10 z i 
11 u v
;

proc sql;
create table arcs as
select 
    a.obs as from, 
    b.obs as to
from 
    have as a inner join 
    have as b on a.obs <= b.obs and (a.var1=b.var1 or a.var2=b.var2);
quit; 

/* get the subgraphs macro from 

https://communities.sas.com/t5/SAS-Communities-Library/How-to-find-all-connected-components-in-a-graph/ta-p/231539

then include the macro definition */

%include "&sasforum.\subgraphsmacro.sas";

/* Call the macro */
%subgraphs(arcs, from=from, to=to, out=sets);

/* Rename and reorder output */
proc sql;
create table newIds as
select 
    clust as id,
    node as obs
from sets
order by id, obs;
select * from newIds;
quit;

 

 

PG
Sejin
Obsidian | Level 7

I can't thank you enough.

After few adjustment, I could get the result that I wanted almost perfectly.

I have some missing values in var2 and I don't want those missing values to be regarded as the same value which will make them to be assiigned one signle id according to the rule nubmer 2. Instead, I want those cases to be an exception for the rule number 2.

For example, if I have a dataset like this.

 

obsvar1var2
1ax
2ay
3bx
4b 
5bz
6cu
7cv
8cw
9d 
10dm

I want a group id for obs 1-5 and another id for obs 6-8 and the other for obs 9-10.

Thank you again for the great reply you gave me before.

I tried to understand the code you gave me and adjust it as I wanted but I failed to do so.

It will be wonderful if you give me an extra help.

Thank you.

Ksharp
Super User

If you have SAS/OR . Base on @mkeintz

 

data have missing;
infile cards expandtabs truncover;
input obs	var1 $ var2 $;
if not missing(var2) then output have;
 else output missing;
drop obs;
cards;
1	a	x
2	a	y
3	b	x
4	b	 
5	b	z
6	c	u
7	c	v
8	c	w
9	d	 
10	d	m
;
run;

data need;
  set have;
  from=cats('_',var1);
  to=cats('-',var2);
run;

proc optnet data_links=need out_nodes=ON (index=(node));
  concomp;
run;

data temp ;
  set have ;
  node=cats('_',var1);
  set on key=node / unique;
  drop node;
run;
data temp1;
 set temp missing;
 by var1;
run;
proc sort data=temp1 out=temp2;
 by var1 descending concomp;
run;
data want;
 set temp2;
 by var1;
 retain new;
 if first.var1 then call missing(new);
 if not missing(concomp) then new=concomp;
 drop concomp;
run;
 
proc print;run;
Ksharp
Super User

If you do not have SAS/OR. Try my code .

 

data have ;
infile cards expandtabs truncover;
input obs	from $ to $;
drop obs;
cards;
1	a	x
2	a	y
3	b	x
4	b	 
5	b	z
6	c	u
7	c	v
8	c	w
9	d	 
10	d	m
;
run;




data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
    h.definekey('node');
     h.definedata('node');
     h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node;
  output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;


data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
 from_to.definekey('from');
 from_to.definedata('to');
 from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
 no.definekey('node');
 no.definedata('node');
 no.definedone();
 

do while(hi_no.next()=0);
 household+1; output;
 count=1;
 key=node;_ha.add();
 last=node;ha.add();
 rc=hi.first();
 do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
     key=to;ry=_ha.check();
      if ry ne 0 then do;
       node=to;output;rr=no.remove(key:node);
       key=to;_ha.add();
       count+1;
       last=to;ha.add();
      end;
      rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;

proc sql;
create table key as
 select * from want
  where node in (select from from have);
create table final_want as
 select a.*,b.household from have as a left join key as b
  on a.from=key.node;
quit;

proc print;run;
Sejin
Obsidian | Level 7

It worked surprisingly well.

I am extremely grateful for your help.

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 8 replies
  • 1416 views
  • 3 likes
  • 4 in conversation