BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
munitech4u
Quartz | Level 8

I have a dataset in sas like this:

 

ID1 ID2

11 22
11 34
22 35
35 9
41 10
52 87
9 65
34 43

 

I want to check for ID1 vs ID2 assignent, and find duplicate IDs, e.g. 11 is assigned to 22 and then 11>>34,22 >>35, 35>>9,9>>65 and 34 >> 43 . So all these IDs should be mapped to single ID, say 11(we can choose any of them). I want to create a 3rd variable, where I will assign all values to a single ID say ID_11

 

so Final output will be like:

ID1   ID2    ID3

11     22     ID_11

11     34     ID_11

22     35     ID_11

35     9       ID_11

41     10     ID_10

52     87     ID_87

9       65     ID_11

34     43     ID_11

 

I found out that, its a disjoint set problem. how to do this using sas?

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

If you know that all your id values are integers in some range (say from 1 through 100), you actually can do this with an array:

 

data have;
  input ID1 ID2;
datalines;
11 22
11 34
22 35
35  9
41 10
52 87
9  65
34 43
run;
data want;
  array src {100} _temporary_;
  if _n_=1 then do until (eod);
    set have end=eod;
    src{id2}=id1;
  end;

  set have ;
  source=id1;
  do Nlinks=1 by 1 while (src{source}^=.);
    source=src{source};
  end;
run;

 

Caveat:  This depends on each sequence of links going to a final source.  I.e. no collections of links forms a repeating cycle.  The program would require some minor changes to deal with cycles.

 

Notes:

  1. The 1-dimensional array SRC has the value of ID1 corresponding to the ID2 element of the array. 

  2. For each pair ID1 is treated as the source, and ID2 as the sink.  The program iteratively finds the source of a source, until an id without a source is encountered.  For example, SRC{11} is a missing value, even though SRC{22}=11  (as well as others).
  3. The variable Nlinks is the number of links required to go from ID2 to the ultimate source.
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

14 REPLIES 14
Haikuo
Onyx | Level 15

This can be solved using Hash, however, make sure you don't have infinite loops, if you do, set a loop limit (say the total number of your obs), when reached, exit the loop regardless.

 

data have;
input ID1 ID2
;
cards;
11     22
22     35
35     9
41     10
52     87
9       65
;

run;

data want;
if _n_=1 then do;
dcl hash h(dataset:'work.have');
h.definekey('id1');
h.definedata(all:'y');
h.definedone();
call missing(id1, id2);
end;

set have(rename=(id1=_id1 id2=_id2));
length id3 $20.;
rc=0;
id1=_id1;
do while (rc=0);
rc=h.find();
id1=id2;
end;
id3=catx('_', 'ID', PUT(id1,BEST.));
KEEP ID3 _id:;
run;
munitech4u
Quartz | Level 8

sorry, There could be duplicate mapping in ID1, I updated the example case

Astounding
PROC Star

Can we make these assumptions?

 

  1. ID1 and ID2 are character
  2. ID1 maps to a single ID2 value only, never to two different ID2 values

If that's the case, the programming is easy enough:

 

data temp;

set have;

fmtname='$test';

rename id1=start id2=label;

run;

 

proc format cntlin=temp;

run;

 

data want;

set have;

length id3 id4 $ 5;

id3 = put(id1, $test.);

id4 = id3;

do i=1 to 500 until (id3 = id4);

   if id3 ne id4 then id3 = id4;

   id4 = put(id3, $test.);

end;

id3 = 'ID_' || id3;

drop i  id4;

run;

 

EDITED to add the missing lines.

munitech4u
Quartz | Level 8
sorry, There could be duplicate mapping in ID1, I updated the example case,
Ksharp
Super User

data have;
infile cards ;
input from $  to $ ;
cards;
11 22
11 34
22 35
35 9
41 10
52 87
9 65
34 43
;
run;
data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
    h.definekey('node');
     h.definedata('node');
     h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node;
  output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;


data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
 from_to.definekey('from');
 from_to.definedata('to');
 from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
 no.definekey('node');
 no.definedata('node');
 no.definedone();
 

do while(hi_no.next()=0);
 household+1; output;
 count=1;
 key=node;_ha.add();
 last=node;ha.add();
 rc=hi.first();
 do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
     key=to;ry=_ha.check();
      if ry ne 0 then do;
       node=to;output;rr=no.remove(key:node);
       key=to;_ha.add();
       count+1;
       last=to;ha.add();
      end;
      rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;
munitech4u
Quartz | Level 8
Thanks for your effort. Is there any alternative where we can avoid declaring hash?
Ksharp
Super User

Sorry. I am afraid you have to use Hash Table to search tree.

Or you could write some SAS/OR code, but you need SAS/OR.

And post your question at OR forum, @RobPratt is there .

munitech4u
Quartz | Level 8

No, we don;t have SAS/OR. Can it be done using arrays? I saw a solution, but not sure how that works:

 

 

data unique_id;
      set x end = last;
      /*Create an array to store mapping and intitialize array*/
      array mapping{&n_obs,2} _temporary_;
      if _N_ = 1 then
      do;
      do i = 1 to &n_obs.;
                  
            mapping[i,1] = i;
            mapping[i,2] = i;
      end;
      end;
      /*Create an array to store pr_row_num and row_num*/
      array lookup{2} row_num pr_row_num;
        orig_val1 = mapping[lookup[1],2];
      orig_val2 = mapping[lookup[2],2];
      max_val = max(row_num,pr_row_num);
      replaced_val = min(lookup[1],lookup[2],mapping[lookup[1],2],mapping[lookup[2],2]);
      /*Loop through the mapping file and change orig_val1 or orig_val2 in the mapping array to replaced_val*/
      do i = 1 to &n_obs.;
            if mapping[i,2] = orig_val1 or mapping[i,2] = orig_val2 then
            mapping[i,2] = replaced_val;
      end;

      if last;
      do i = 1 to &n_obs.;
            key1 = mapping[i,1];
            key2 = mapping[i,2];
            output;
      end;
run;

 

Ksharp
Super User

Sorry. As far as I know, must Hash Table, no choice.

Or you could switch into SAS/OR ,just as @RobPratt said.

RobPratt
SAS Super FREQ

This functionality is called "connected components" and is available in:

  • PROC OPTNET or PROC OPTMODEL in SAS/OR
  • PROC OPTGRAPH in SAS Social Network Analysis
  • PROC NETWORK in SAS Visual Data Mining and Machine Learning (Viya)
  • PROC OPTNETWORK or PROC OPTMODEL in SAS Optimization (Viya)
mkeintz
PROC Star

If you know that all your id values are integers in some range (say from 1 through 100), you actually can do this with an array:

 

data have;
  input ID1 ID2;
datalines;
11 22
11 34
22 35
35  9
41 10
52 87
9  65
34 43
run;
data want;
  array src {100} _temporary_;
  if _n_=1 then do until (eod);
    set have end=eod;
    src{id2}=id1;
  end;

  set have ;
  source=id1;
  do Nlinks=1 by 1 while (src{source}^=.);
    source=src{source};
  end;
run;

 

Caveat:  This depends on each sequence of links going to a final source.  I.e. no collections of links forms a repeating cycle.  The program would require some minor changes to deal with cycles.

 

Notes:

  1. The 1-dimensional array SRC has the value of ID1 corresponding to the ID2 element of the array. 

  2. For each pair ID1 is treated as the source, and ID2 as the sink.  The program iteratively finds the source of a source, until an id without a source is encountered.  For example, SRC{11} is a missing value, even though SRC{22}=11  (as well as others).
  3. The variable Nlinks is the number of links required to go from ID2 to the ultimate source.
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
munitech4u
Quartz | Level 8
Thanks. The solution is simply elegant and simple
mkeintz
PROC Star

Edited on 12/15.  Thanks to @Matthew_Galati, I've revised the program and my comments.  Please ignore all the text is strikethrough font.

 

If you have SAS/OR, you can get this directly by asking PROC OPTNET for "connected components".  It becomes a trivial task:

 

data have;
  input ID1 ID2;
  retain w 1;
datalines;
11 22
11 34
22 35
35 9
41 10
52 87
9 65
34 43
run;


proc optnet data_links=have (rename=(id1=from id2=to w=weight))
      graph_direction=undirected out_nodes=nodes ;
  concomp;
shortpath out_weights=shdist; run; proc sql; create table want as select h.*, n.concomp from have as h left join nodes as n on h.id1=n.node; quit;

 

Notes:

  1. I had to add a "weight" variable (var w) to dataset HAVE, which proc optnet uses as the "distance" between ID1 and ID2 (optnet is commonly used to generate the shortest distances between all pairs of connected nodes (id's in your case).  As long as W is a non-missing numeric value, it works for your problem, because you care about connectivity, not distance.  Of course if your constant W value is 1, then the "distance" is the number of hopes from ID2 to ID2.

    Because, in the absence of a weight variable (thanks @Matthew_Galati), the default "distance" from ID2 to ID1 is taken as 1.  As a result the variable path_weight in SHDIST is just the number of hops from ID2 to SOURCE.  Here's what the 1st 20 obs in SHDIST data set looks like:

                     path_

    Obs source sink weight

     1    11    22    1

     2    11    34    1

     3    11    35    2

     4    11     9    3

     5    11    65    4

     6    11    43    2

     7    22    11    1

     8    22    34    2

     9    22    35    1

    10    22     9    2

    11    22    65    3

    12    22    43    3

    13    34    11    1

    14    34    22    2

    15    34    35    3

    16    34     9    4

    17    34    65    5

    18    34    43    1

    19    35    11    2

    20    35    22    1

  2. The "concomp" statement asks proc optnet to assign each node (id) to a connected-component identifier.  In your case here's, what the "nodes" dataset looks like:

    Obs node concomp

     1   11    1

     2   22    1

     3   34    1

     4   35    1

     5    9    1

     6   41    2

     7   10    2

     8   52    3

     9   87    3

    10   65    1

    11   43    1

  3. And the resulting dataset WANT:

    Obs    ID1    ID2    w    concomp
    
     1       9     65    1       1
     2      11     22    1       1
     3      11     34    1       1
     4      22     35    1       1
     5      34     43    1       1
     6      35      9    1       1
     7      41     10    1       2
     8      52     87    1       3
    
    
    
    
    
    
    
    
    
    
    
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Matthew_Galati
SAS Employee

You do not need to add the weight variable for optnet shortest path to get the 'level'. Unweighted graphs assume unit weights on all edges.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 14 replies
  • 2331 views
  • 4 likes
  • 7 in conversation