I have a dataset in sas like this:
ID1 ID2
11 22
11 34
22 35
35 9
41 10
52 87
9 65
34 43
I want to check for ID1 vs ID2 assignent, and find duplicate IDs, e.g. 11 is assigned to 22 and then 11>>34,22 >>35, 35>>9,9>>65 and 34 >> 43 . So all these IDs should be mapped to single ID, say 11(we can choose any of them). I want to create a 3rd variable, where I will assign all values to a single ID say ID_11
so Final output will be like:
ID1 ID2 ID3
11 22 ID_11
11 34 ID_11
22 35 ID_11
35 9 ID_11
41 10 ID_10
52 87 ID_87
9 65 ID_11
34 43 ID_11
I found out that, its a disjoint set problem. how to do this using sas?
If you know that all your id values are integers in some range (say from 1 through 100), you actually can do this with an array:
data have;
input ID1 ID2;
datalines;
11 22
11 34
22 35
35 9
41 10
52 87
9 65
34 43
run;
data want;
array src {100} _temporary_;
if _n_=1 then do until (eod);
set have end=eod;
src{id2}=id1;
end;
set have ;
source=id1;
do Nlinks=1 by 1 while (src{source}^=.);
source=src{source};
end;
run;
Caveat: This depends on each sequence of links going to a final source. I.e. no collections of links forms a repeating cycle. The program would require some minor changes to deal with cycles.
Notes:
This can be solved using Hash, however, make sure you don't have infinite loops, if you do, set a loop limit (say the total number of your obs), when reached, exit the loop regardless.
data have;
input ID1 ID2
;
cards;
11 22
22 35
35 9
41 10
52 87
9 65
;
run;
data want;
if _n_=1 then do;
dcl hash h(dataset:'work.have');
h.definekey('id1');
h.definedata(all:'y');
h.definedone();
call missing(id1, id2);
end;
set have(rename=(id1=_id1 id2=_id2));
length id3 $20.;
rc=0;
id1=_id1;
do while (rc=0);
rc=h.find();
id1=id2;
end;
id3=catx('_', 'ID', PUT(id1,BEST.));
KEEP ID3 _id:;
run;
sorry, There could be duplicate mapping in ID1, I updated the example case
Can we make these assumptions?
If that's the case, the programming is easy enough:
data temp;
set have;
fmtname='$test';
rename id1=start id2=label;
run;
proc format cntlin=temp;
run;
data want;
set have;
length id3 id4 $ 5;
id3 = put(id1, $test.);
id4 = id3;
do i=1 to 500 until (id3 = id4);
if id3 ne id4 then id3 = id4;
id4 = put(id3, $test.);
end;
id3 = 'ID_' || id3;
drop i id4;
run;
EDITED to add the missing lines.
data have;
infile cards ;
input from $ to $ ;
cards;
11 22
11 34
22 35
35 9
41 10
52 87
9 65
34 43
;
run;
data full;
set have end=last;
if _n_ eq 1 then do;
declare hash h();
h.definekey('node');
h.definedata('node');
h.definedone();
end;
output;
node=from; h.replace();
from=to; to=node;
output;
node=from; h.replace();
if last then h.output(dataset:'node');
drop node;
run;
data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();
if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
from_to.definekey('from');
from_to.definedata('to');
from_to.definedone();
if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
no.definekey('node');
no.definedata('node');
no.definedone();
do while(hi_no.next()=0);
household+1; output;
count=1;
key=node;_ha.add();
last=node;ha.add();
rc=hi.first();
do while(rc=0);
from=last;rx=from_to.find();
do while(rx=0);
key=to;ry=_ha.check();
if ry ne 0 then do;
node=to;output;rr=no.remove(key:node);
key=to;_ha.add();
count+1;
last=to;ha.add();
end;
rx=from_to.find_next();
end;
rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;
Sorry. I am afraid you have to use Hash Table to search tree.
Or you could write some SAS/OR code, but you need SAS/OR.
And post your question at OR forum, @RobPratt is there .
No, we don;t have SAS/OR. Can it be done using arrays? I saw a solution, but not sure how that works:
data unique_id;
set x end = last;
/*Create an array to store mapping and intitialize array*/
array mapping{&n_obs,2} _temporary_;
if _N_ = 1 then
do;
do i = 1 to &n_obs.;
mapping[i,1] = i;
mapping[i,2] = i;
end;
end;
/*Create an array to store pr_row_num and row_num*/
array lookup{2} row_num pr_row_num;
orig_val1 = mapping[lookup[1],2];
orig_val2 = mapping[lookup[2],2];
max_val = max(row_num,pr_row_num);
replaced_val = min(lookup[1],lookup[2],mapping[lookup[1],2],mapping[lookup[2],2]);
/*Loop through the mapping file and change orig_val1 or orig_val2 in the mapping array to replaced_val*/
do i = 1 to &n_obs.;
if mapping[i,2] = orig_val1 or mapping[i,2] = orig_val2 then
mapping[i,2] = replaced_val;
end;
if last;
do i = 1 to &n_obs.;
key1 = mapping[i,1];
key2 = mapping[i,2];
output;
end;
run;
Sorry. As far as I know, must Hash Table, no choice.
Or you could switch into SAS/OR ,just as @RobPratt said.
This functionality is called "connected components" and is available in:
If you know that all your id values are integers in some range (say from 1 through 100), you actually can do this with an array:
data have;
input ID1 ID2;
datalines;
11 22
11 34
22 35
35 9
41 10
52 87
9 65
34 43
run;
data want;
array src {100} _temporary_;
if _n_=1 then do until (eod);
set have end=eod;
src{id2}=id1;
end;
set have ;
source=id1;
do Nlinks=1 by 1 while (src{source}^=.);
source=src{source};
end;
run;
Caveat: This depends on each sequence of links going to a final source. I.e. no collections of links forms a repeating cycle. The program would require some minor changes to deal with cycles.
Notes:
Edited on 12/15. Thanks to @Matthew_Galati, I've revised the program and my comments. Please ignore all the text is strikethrough font.
If you have SAS/OR, you can get this directly by asking PROC OPTNET for "connected components". It becomes a trivial task:
data have;
input ID1 ID2;
retain w 1;
datalines;
11 22
11 34
22 35
35 9
41 10
52 87
9 65
34 43
run;
proc optnet data_links=have (rename=(id1=from id2=to w=weight))
graph_direction=undirected out_nodes=nodes ;
concomp;
shortpath out_weights=shdist;
run;
proc sql;
create table want as
select h.*, n.concomp from
have as h left join nodes as n
on h.id1=n.node;
quit;
Notes:
path_
Obs source sink weight
1 11 22 1
2 11 34 1
3 11 35 2
4 11 9 3
5 11 65 4
6 11 43 2
7 22 11 1
8 22 34 2
9 22 35 1
10 22 9 2
11 22 65 3
12 22 43 3
13 34 11 1
14 34 22 2
15 34 35 3
16 34 9 4
17 34 65 5
18 34 43 1
19 35 11 2
20 35 22 1
Obs node concomp
1 11 1
2 22 1
3 34 1
4 35 1
5 9 1
6 41 2
7 10 2
8 52 3
9 87 3
10 65 1
11 43 1
Obs ID1 ID2 w concomp
1 9 65 1 1
2 11 22 1 1
3 11 34 1 1
4 22 35 1 1
5 34 43 1 1
6 35 9 1 1
7 41 10 1 2
8 52 87 1 3
You do not need to add the weight variable for optnet shortest path to get the 'level'. Unweighted graphs assume unit weights on all edges.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.