Hello. I have a trouble in assiging same id by group.
I have a dataset that looks like this.
obs | var1 | var2 |
1 | x | a |
2 | x | b |
3 | y | c |
4 | y | d |
5 | y | e |
6 | y | b |
7 | z | f |
8 | z | g |
9 | z | h |
10 | z | i |
I want to assign codes based on the var1. So obs 1, 2 should be assigned to the same id because they share "x" value on var1.
However, I also have another rule which is assigning same id if observations share same value of var2. For example, following this rule, obs 1-6 should be assigned same id because obs 6 has "b" on var2 which is same as that of obs2.
Could anybody help me find a code to solve the problem?
Thanks for reading the thread.
If you do not have SAS/OR. Try my code .
data have ;
infile cards expandtabs truncover;
input obs from $ to $;
drop obs;
cards;
1 a x
2 a y
3 b x
4 b
5 b z
6 c u
7 c v
8 c w
9 d
10 d m
;
run;
data full;
set have end=last;
if _n_ eq 1 then do;
declare hash h();
h.definekey('node');
h.definedata('node');
h.definedone();
end;
output;
node=from; h.replace();
from=to; to=node;
output;
node=from; h.replace();
if last then h.output(dataset:'node');
drop node;
run;
data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();
if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
from_to.definekey('from');
from_to.definedata('to');
from_to.definedone();
if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
no.definekey('node');
no.definedata('node');
no.definedone();
do while(hi_no.next()=0);
household+1; output;
count=1;
key=node;_ha.add();
last=node;ha.add();
rc=hi.first();
do while(rc=0);
from=last;rx=from_to.find();
do while(rx=0);
key=to;ry=_ha.check();
if ry ne 0 then do;
node=to;output;rr=no.remove(key:node);
key=to;_ha.add();
count+1;
last=to;ha.add();
end;
rx=from_to.find_next();
end;
rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;
proc sql;
create table key as
select * from want
where node in (select from from have);
create table final_want as
select a.*,b.household from have as a left join key as b
on a.from=key.node;
quit;
proc print;run;
You have stated an example of a rule that says:
then does the rule also imply that obs 3,4, and 5 also get ID=1 since they share var1=Y with obs 6?
How recursive is your linking rule? Is an ID group comprised of all obs that share either var1 or var2 with at least one other group member?
The way you understood is absoulely true. Rule 1 and 2 imply that obs 1 through 6 should be assigned the same code.
Also, you can summarize that members of a group that has been assigned the same code share either same var1 or var2 wih at least one of the members.
Do you have SAS/OR /ETS? PROC OPTNET, which is in sas/or, has a solution for this, using the CONCOMP (connected components) statement.
data have;
input var1 :$1. var2 :$1. @@;
datalines;
x a x b y c y d y e y b z f z g z h z i
run;
data need;
set have;
from=cats('_',var1);
to=cats('-',var2);
run;
proc optnet data_links=need out_nodes=ON (index=(node));
concomp;
run;
data want ;
set have ;
node=cats('_',var1);
set on key=node / unique;
drop node;
run;
Data set NEED is created to prefix '_' to all var1 values, and prefix '-' to all var2 values, so that otherwise identical "node" values in var1 and var2 are not seen as the same node. The CONCOMP statement in proc optnet assign a connected-component id - take a look at the ON dataset.
But I suspect you need to work through this problem using the data step. What have you tried so far?
Your problem amounts to finding all connected components in a network (or a graph). This problem is addressed by proc optnet. If you don't have a SAS/OR licence, you can use my subgraphs macro available here to find the components. Here's how to do it:
data have;
input obs var1 $ var2 $;
datalines;
1 x a
2 x b
3 y c
4 y d
5 y e
6 y b
7 z f
8 z g
9 z h
10 z i
11 u v
;
proc sql;
create table arcs as
select
a.obs as from,
b.obs as to
from
have as a inner join
have as b on a.obs <= b.obs and (a.var1=b.var1 or a.var2=b.var2);
quit;
/* get the subgraphs macro from
https://communities.sas.com/t5/SAS-Communities-Library/How-to-find-all-connected-components-in-a-graph/ta-p/231539
then include the macro definition */
%include "&sasforum.\subgraphsmacro.sas";
/* Call the macro */
%subgraphs(arcs, from=from, to=to, out=sets);
/* Rename and reorder output */
proc sql;
create table newIds as
select
clust as id,
node as obs
from sets
order by id, obs;
select * from newIds;
quit;
I can't thank you enough.
After few adjustment, I could get the result that I wanted almost perfectly.
I have some missing values in var2 and I don't want those missing values to be regarded as the same value which will make them to be assiigned one signle id according to the rule nubmer 2. Instead, I want those cases to be an exception for the rule number 2.
For example, if I have a dataset like this.
obs | var1 | var2 |
1 | a | x |
2 | a | y |
3 | b | x |
4 | b | |
5 | b | z |
6 | c | u |
7 | c | v |
8 | c | w |
9 | d | |
10 | d | m |
I want a group id for obs 1-5 and another id for obs 6-8 and the other for obs 9-10.
Thank you again for the great reply you gave me before.
I tried to understand the code you gave me and adjust it as I wanted but I failed to do so.
It will be wonderful if you give me an extra help.
Thank you.
If you have SAS/OR . Base on @mkeintz
data have missing;
infile cards expandtabs truncover;
input obs var1 $ var2 $;
if not missing(var2) then output have;
else output missing;
drop obs;
cards;
1 a x
2 a y
3 b x
4 b
5 b z
6 c u
7 c v
8 c w
9 d
10 d m
;
run;
data need;
set have;
from=cats('_',var1);
to=cats('-',var2);
run;
proc optnet data_links=need out_nodes=ON (index=(node));
concomp;
run;
data temp ;
set have ;
node=cats('_',var1);
set on key=node / unique;
drop node;
run;
data temp1;
set temp missing;
by var1;
run;
proc sort data=temp1 out=temp2;
by var1 descending concomp;
run;
data want;
set temp2;
by var1;
retain new;
if first.var1 then call missing(new);
if not missing(concomp) then new=concomp;
drop concomp;
run;
proc print;run;
If you do not have SAS/OR. Try my code .
data have ;
infile cards expandtabs truncover;
input obs from $ to $;
drop obs;
cards;
1 a x
2 a y
3 b x
4 b
5 b z
6 c u
7 c v
8 c w
9 d
10 d m
;
run;
data full;
set have end=last;
if _n_ eq 1 then do;
declare hash h();
h.definekey('node');
h.definedata('node');
h.definedone();
end;
output;
node=from; h.replace();
from=to; to=node;
output;
node=from; h.replace();
if last then h.output(dataset:'node');
drop node;
run;
data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();
if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
from_to.definekey('from');
from_to.definedata('to');
from_to.definedone();
if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
no.definekey('node');
no.definedata('node');
no.definedone();
do while(hi_no.next()=0);
household+1; output;
count=1;
key=node;_ha.add();
last=node;ha.add();
rc=hi.first();
do while(rc=0);
from=last;rx=from_to.find();
do while(rx=0);
key=to;ry=_ha.check();
if ry ne 0 then do;
node=to;output;rr=no.remove(key:node);
key=to;_ha.add();
count+1;
last=to;ha.add();
end;
rx=from_to.find_next();
end;
rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;
proc sql;
create table key as
select * from want
where node in (select from from have);
create table final_want as
select a.*,b.household from have as a left join key as b
on a.from=key.node;
quit;
proc print;run;
It worked surprisingly well.
I am extremely grateful for your help.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.