I have a dataset in sas like this:
ID1 ID2
11 22
11 34
22 35
35 9
41 10
52 87
9 65
34 43
I want to check for ID1 vs ID2 assignent, and find duplicate IDs, e.g. 11 is assigned to 22 and then 11>>34,22 >>35, 35>>9,9>>65 and 34 >> 43 . So all these IDs should be mapped to single ID, say 11(we can choose any of them). I want to create a 3rd variable, where I will assign all values to a single ID say ID_11
so Final output will be like:
ID1 ID2 ID3
11 22 ID_11
11 34 ID_11
22 35 ID_11
35 9 ID_11
41 10 ID_10
52 87 ID_87
9 65 ID_11
34 43 ID_11
I found out that, its a disjoint set problem. how to do this using sas?
If you know that all your id values are integers in some range (say from 1 through 100), you actually can do this with an array:
data have;
input ID1 ID2;
datalines;
11 22
11 34
22 35
35 9
41 10
52 87
9 65
34 43
run;
data want;
array src {100} _temporary_;
if _n_=1 then do until (eod);
set have end=eod;
src{id2}=id1;
end;
set have ;
source=id1;
do Nlinks=1 by 1 while (src{source}^=.);
source=src{source};
end;
run;
Caveat: This depends on each sequence of links going to a final source. I.e. no collections of links forms a repeating cycle. The program would require some minor changes to deal with cycles.
Notes:
This can be solved using Hash, however, make sure you don't have infinite loops, if you do, set a loop limit (say the total number of your obs), when reached, exit the loop regardless.
data have;
input ID1 ID2
;
cards;
11 22
22 35
35 9
41 10
52 87
9 65
;
run;
data want;
if _n_=1 then do;
dcl hash h(dataset:'work.have');
h.definekey('id1');
h.definedata(all:'y');
h.definedone();
call missing(id1, id2);
end;
set have(rename=(id1=_id1 id2=_id2));
length id3 $20.;
rc=0;
id1=_id1;
do while (rc=0);
rc=h.find();
id1=id2;
end;
id3=catx('_', 'ID', PUT(id1,BEST.));
KEEP ID3 _id:;
run;
sorry, There could be duplicate mapping in ID1, I updated the example case
Can we make these assumptions?
If that's the case, the programming is easy enough:
data temp;
set have;
fmtname='$test';
rename id1=start id2=label;
run;
proc format cntlin=temp;
run;
data want;
set have;
length id3 id4 $ 5;
id3 = put(id1, $test.);
id4 = id3;
do i=1 to 500 until (id3 = id4);
if id3 ne id4 then id3 = id4;
id4 = put(id3, $test.);
end;
id3 = 'ID_' || id3;
drop i id4;
run;
EDITED to add the missing lines.
data have;
infile cards ;
input from $ to $ ;
cards;
11 22
11 34
22 35
35 9
41 10
52 87
9 65
34 43
;
run;
data full;
set have end=last;
if _n_ eq 1 then do;
declare hash h();
h.definekey('node');
h.definedata('node');
h.definedone();
end;
output;
node=from; h.replace();
from=to; to=node;
output;
node=from; h.replace();
if last then h.output(dataset:'node');
drop node;
run;
data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();
if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
from_to.definekey('from');
from_to.definedata('to');
from_to.definedone();
if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
no.definekey('node');
no.definedata('node');
no.definedone();
do while(hi_no.next()=0);
household+1; output;
count=1;
key=node;_ha.add();
last=node;ha.add();
rc=hi.first();
do while(rc=0);
from=last;rx=from_to.find();
do while(rx=0);
key=to;ry=_ha.check();
if ry ne 0 then do;
node=to;output;rr=no.remove(key:node);
key=to;_ha.add();
count+1;
last=to;ha.add();
end;
rx=from_to.find_next();
end;
rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;
Sorry. I am afraid you have to use Hash Table to search tree.
Or you could write some SAS/OR code, but you need SAS/OR.
And post your question at OR forum, @RobPratt is there .
No, we don;t have SAS/OR. Can it be done using arrays? I saw a solution, but not sure how that works:
data unique_id;
set x end = last;
/*Create an array to store mapping and intitialize array*/
array mapping{&n_obs,2} _temporary_;
if _N_ = 1 then
do;
do i = 1 to &n_obs.;
mapping[i,1] = i;
mapping[i,2] = i;
end;
end;
/*Create an array to store pr_row_num and row_num*/
array lookup{2} row_num pr_row_num;
orig_val1 = mapping[lookup[1],2];
orig_val2 = mapping[lookup[2],2];
max_val = max(row_num,pr_row_num);
replaced_val = min(lookup[1],lookup[2],mapping[lookup[1],2],mapping[lookup[2],2]);
/*Loop through the mapping file and change orig_val1 or orig_val2 in the mapping array to replaced_val*/
do i = 1 to &n_obs.;
if mapping[i,2] = orig_val1 or mapping[i,2] = orig_val2 then
mapping[i,2] = replaced_val;
end;
if last;
do i = 1 to &n_obs.;
key1 = mapping[i,1];
key2 = mapping[i,2];
output;
end;
run;
Sorry. As far as I know, must Hash Table, no choice.
Or you could switch into SAS/OR ,just as @RobPratt said.
This functionality is called "connected components" and is available in:
If you know that all your id values are integers in some range (say from 1 through 100), you actually can do this with an array:
data have;
input ID1 ID2;
datalines;
11 22
11 34
22 35
35 9
41 10
52 87
9 65
34 43
run;
data want;
array src {100} _temporary_;
if _n_=1 then do until (eod);
set have end=eod;
src{id2}=id1;
end;
set have ;
source=id1;
do Nlinks=1 by 1 while (src{source}^=.);
source=src{source};
end;
run;
Caveat: This depends on each sequence of links going to a final source. I.e. no collections of links forms a repeating cycle. The program would require some minor changes to deal with cycles.
Notes:
Edited on 12/15. Thanks to @Matthew_Galati, I've revised the program and my comments. Please ignore all the text is strikethrough font.
If you have SAS/OR, you can get this directly by asking PROC OPTNET for "connected components". It becomes a trivial task:
data have;
input ID1 ID2;
retain w 1;
datalines;
11 22
11 34
22 35
35 9
41 10
52 87
9 65
34 43
run;
proc optnet data_links=have (rename=(id1=from id2=to w=weight))
graph_direction=undirected out_nodes=nodes ;
concomp;
shortpath out_weights=shdist;
run;
proc sql;
create table want as
select h.*, n.concomp from
have as h left join nodes as n
on h.id1=n.node;
quit;
Notes:
path_
Obs source sink weight
1 11 22 1
2 11 34 1
3 11 35 2
4 11 9 3
5 11 65 4
6 11 43 2
7 22 11 1
8 22 34 2
9 22 35 1
10 22 9 2
11 22 65 3
12 22 43 3
13 34 11 1
14 34 22 2
15 34 35 3
16 34 9 4
17 34 65 5
18 34 43 1
19 35 11 2
20 35 22 1
Obs node concomp
1 11 1
2 22 1
3 34 1
4 35 1
5 9 1
6 41 2
7 10 2
8 52 3
9 87 3
10 65 1
11 43 1
Obs ID1 ID2 w concomp
1 9 65 1 1
2 11 22 1 1
3 11 34 1 1
4 22 35 1 1
5 34 43 1 1
6 35 9 1 1
7 41 10 1 2
8 52 87 1 3
You do not need to add the weight variable for optnet shortest path to get the 'level'. Unweighted graphs assume unit weights on all edges.
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.