BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Reeza
Super User

I posted the macro here as well if you wanted to try it:

https://gist.github.com/statgeek/14e3aa2a9f718f551cd98134e9ceed30

 

 

FreelanceReinh
Jade | Level 19

I reckon your output is based on a different version of T1, because when I run the following steps, I obtain the result shown further below. Please note that datasets T1 and T2 are overwritten again and again in the iteration (%DO %UNTIL loop).

data have;
input ID1 $ ID2 $;
cards;
A1 A2
A2 A3
A3 A1
A4 A5
A6 A7
A7 A8
A1 A9
;

data have2;
set have
    have(rename=(ID1=ID2 ID2=ID1));
run;

proc sort data=have2;
by ID1;
run;

data t1;
set have2;
by ID1;
g+first.ID1;
run;

proc print data=t1 noobs;
run;

Result:

ID1    ID2    g

A1     A2     1
A1     A9     1
A1     A3     1
A2     A3     2
A2     A1     2
A3     A1     3
A3     A2     3
A4     A5     4
A5     A4     5
A6     A7     6
A7     A8     7
A7     A6     7
A8     A7     8
A9     A1     9
munitech4u
Quartz | Level 8
oh, maybe thats why. It was so much confusing me
munitech4u
Quartz | Level 8
%do %until(&n1=&n2);

Now this is sum of g in 2 tables. and we want to run the loop, until it gets equal. Don't you think, there are chances that these numbers may match prior to optimum?

e.g. 1+2+5+10 and 1+3+4+10 are same, especially when the observations get large
FreelanceReinh
Jade | Level 19

@munitech4u wrote:
%do %until(&n1=&n2);

Now this is sum of g in 2 tables. and we want to run the loop, until it gets equal. Don't you think, there are chances that these numbers may match prior to optimum?

e.g. 1+2+5+10 and 1+3+4+10 are same, especially when the observations get large

I don't think so.

 

My reasoning behind this criterion was: After the initial assignment of group numbers on one "side" (i.e. based on ID1) in dataset HAVE2, the same procedure of reassignments of group numbers is applied alternately based on ID2, ID1, ID2, ID1 and so on.

As the new group numbers are always less than or equal to the previous group numbers (they are replaced by the minimum of their respective BY group), the sum of all group numbers is monotonically decreasing from one iteration step to the next. If it stays the same in one step (&n1=&n2), not a single group number may have changed in this step (because otherwise the decrease of this group number would have reduced the sum of all group numbers as well). This means, after the group numbers have been made consistent with regard to one of the two IDs in the previous iteration step, they now turn out to be already consistent also with regard to the other ID. Hence, the algorithm terminates.

 

A situation where the two sequences of group numbers from two consecutive steps are (1, 2, 5, 10) and (1, 3, 4, 10) is impossible, because group numbers can never increase (from 2 to 3 or from 4 to 5). The same argument holds for any two different sequences with equal sums (&n1=&n2).

 

Another question would be: Is it possible that the algorithm ends up in a sort of "local" optimum which is not the "global" optimum?

I say no. Because: In this case there would be at least two different IDs x, y such that their cluster numbers were equal (say, 1) in the "global optimum" solution and unequal (say, 1 and 2) in the "local optimum" solution. The former implies that there is a (finite) chain of linked IDs (linked by pairs in dataset HAVE2) from x to y all of which have cluster number 1. Following this same chain from both ends (x in cluster 1 and y in cluster 2) in the "local optimum" solution necessarily leads to a contradiction (an ID with two different cluster numbers).

 

Thanks for analyzing the algorithm so thoroughly. It would indeed be interesting to prove its correctness mathematically.

 

Since other solutions have now been suggested, you can also compare the results for validation purposes.

FreelanceReinh
Jade | Level 19

@munitech4u wrote:
Mine, is even more complex, as the relationship is not just one way, it can be two way. So like in the example give in other solution, the player id can occur in assignment id and vice versa.

I don't think that this makes it significantly more complex. In order to use the solutions in the other recent thread as much as possible, one could regard your ID1 and ID2 as completely separate sets of IDs (like the assignment and player IDs in the other thread), then assign the cluster numbers (these correspond to the group numbers g in my solution in the other thread) and finally merge the ID1 and ID2 values within each cluster. But intuitively I guess that the solution can be even simpler, given that there is only one type of ID.

munitech4u
Quartz | Level 8

I have an idea, but not sure how to execute this:

 

1. Get a dataset with all distinct IDs. 

2. Now Pick the first ID, and look for all the possible IDs in ID1 or ID2 that occurs with this ID, put them in cluster 1.

3. Now Get all other distinct IDs that occur above and repeat the step above. Keep updating the cluster.

4. Loop until we reach at a point, when we have iterated through all the IDs in cluster 1.

5. Eliminate the cluster 1 IDs from the dataset with distinct IDs and the dataset we are looking into.

6. Repeat step 2-5 until we are finished with all IDs. 

 

Example:

 

Distinct IDs

A1

A2

A3

A4

A5

A6

A7

A8

A9

 

First ID: A1

 

All possible combination from dataset:

A1 A2 

A1 A9

Cluster 1: A1,A2,A9

 

Loop Through cluster 1 except A1: For A2, possible combination:

A2 A3

Update cluster 1: A1,A2,A3,A9

 

Loop Through cluster 1 except A1,A2: For A9, possible combination:

A9 A1

Update cluster 1: A1,A2,A3,A9

 

Eliminate A1,A2,A3,A9 from distinct IDs and observations wherever they occur in dataset

 

Repeat above steps, keep incrementing the cluster.

 

 

 

Ksharp
Super User
Here is :





data have;
infile cards ;
input from $  to $ ;
cards;
A1 A2
A2 A3
A4 A5
A6 A7
A7 A8
A1 A9
;
run;
data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
    h.definekey('node');
     h.definedata('node');
     h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node;
  output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;


data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
 from_to.definekey('from');
 from_to.definedata('to');
 from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
 no.definekey('node');
 no.definedata('node');
 no.definedone();
 

do while(hi_no.next()=0);
 household+1; output;
 count=1;
 key=node;_ha.add();
 last=node;ha.add();
 rc=hi.first();
 do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
     key=to;ry=_ha.check();
      if ry ne 0 then do;
       node=to;output;rr=no.remove(key:node);
       key=to;_ha.add();
       count+1;
       last=to;ha.add();
      end;
      rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;



munitech4u
Quartz | Level 8
Thanks for your effort Xia. But this hash thing goes over my head. 😄

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 24 replies
  • 5683 views
  • 5 likes
  • 5 in conversation