Solved: Re: Pair two ID's based on the smallest distance between them and use ...

JelmerB · Posted 11-24-2020 12:18 PM

Hello everyone,

I have a table with the calculated distance (var1) between customers (ID_a and ID_b). Var2 and var3 contain some other variables that are not relevant for the connection but I want to keep them. The data is sorted on ID_a and var1.

data have;
input (ID_a ID_b var1 var2 var3)($);
cards;
	A B 12 xxx yyy
	A C 36 xxx yyy
	C D 17 xxx yyy
	D A 18 xxx yyy
	D F 80 xxx yyy
	G B 20 xxx yyy
	G K 32 xxx yyy

run;

Now I want to pair ID's with the smallest distance between them, and the ID may only be used once in the output. So the output should look like this:

data want;
input (ID_a ID_b var1 var2 var3)($);
cards;
   A B 12 xxx yyy
   C D 17 xxx yyy
   G K 32 xxx yyy

run;

The goal is to get a list where every ID is paired to one other ID, with the closest possible distance. Of course there are multiple combinations, because A can be the closest to B, while at the same time B can be more close to C. If possible it would be nice to get the combinations with the lowest total distance.

The data set is ~2600 rows with 224 unique ID's. So the output should be 112 rows with a ID_a - ID_b combination.

What would be a solution to do this? I have been working on PROC SORT with NODUPKEY, but that is not going to work because each row consists of a unique combination. I am thinking about using a macro to select every first possibility of ID_a, while checking if the ID already has been used in ID_a or ID_b. If no, write the line to the output file and proceed to the next ID_a. If yes, then proceed to the second possibility/connection of ID_a and run the macro again. I can't find any example of such a routine, and I’m not sure this is the best solution. Maybe there are other ways to do this?

Thank you in advance for your suggestions!

PeterClemmensen · Posted 11-25-2020 05:29 AM

A small correction in the code below (changed the sample data to numeric IDs)

Let me know if it works 🙂

data have;
input ID_a ID_b var1 var2 $ var3 $;
cards;
1 2  12 xxx yyy
1 3  36 xxx yyy
3 4  17 xxx yyy
4 1  18 xxx yyy
4 6  80 xxx yyy
7 2  20 xxx yyy
7 11 32 xxx yyy
;

data want(drop = id);
   dcl hash h(dataset : 'have', multidata : 'Y', ordered : 'A');
   h.definekey('var1', 'ID_a', 'ID_b');
   h.definedata(all : 'Y');
   h.definedone();
   dcl hiter i('h');

   dcl hash u();
   u.definekey('id');
   u.definedone();

   if 0 then set have;
   id = .;

   do while (i.next()=0);
      if u.check(key : ID_a) and u.check(key : ID_b) then do;
         u.ref(key : ID_a, data : ID_a);
         u.ref(key : ID_b, data : ID_b);
         output;
      end;
   end;
run;

Result:

ID_a  ID_b  var1  var2  var3 
1     2     12    xxx   yyy 
3     4     17    xxx   yyy 
7     11    32    xxx   yyy

The DATA to DATA Step Macro
Blog: SASnrd

View solution in original post

Reeza · Posted 11-24-2020 01:16 PM

This seems like a linear optimization problem - perhaps the traveling salesmen problem?
Do you have a license for SAS/OR? It is included in On Demand for Academics - seems like you want the traveling salesman problem with a directed graph since you know which nodes you can travel along.
https://documentation.sas.com/?cdcId=pgmsascdc&cdcVersion=9.4_3.4&docsetId=ornoaug&docsetTarget=orno...

Reeza · Posted 11-24-2020 01:17 PM

If you're not familiar with the traveling salesman problem here's the Wikipedia link:
https://en.wikipedia.org/wiki/Travelling_salesman_problem

PeterClemmensen · Posted 11-25-2020 04:16 AM

Here is a solution along the lines of the logic you outline yourself

Read the entire data set into an ordered hash object h.
Create an additional hash object u to keep track of the ids already encountered.
Traverse h one item at the time.
If both IDs already exist in U, they have already been encountered. Therefore nothing happens. If not, we add them to U and output the observation to the desired data set.

First, verify that the approach works as desired on your sample data set. Then test if it works on your actual data. 2600 Obs is no problem for the hash to handle. Let me know if it works for you.

data have;
input ID_a $ ID_b $ var1 var2 $ var3 $;
cards;
A B 12 xxx yyy
A C 36 xxx yyy
C D 17 xxx yyy
D A 18 xxx yyy
D F 80 xxx yyy
G B 20 xxx yyy
G K 32 xxx yyy
;

data want(drop = id);
   dcl hash h(dataset : 'have', multidata : 'Y', ordered : 'A');
   h.definekey('var1', 'ID_a', 'ID_b');
   h.definedata(all : 'Y');
   h.definedone();
   dcl hiter i('h');

   dcl hash u();
   u.definekey('id');
   u.definedone();

   if 0 then set have;
   id = '        ';

   do while (i.next()=0);
      if u.check(key : ID_a) and u.check(key : ID_b) then do;
         u.ref(key : ID_a, data : ID_a);
         u.ref(key : ID_b, data : ID_b);
         output;
      end;
   end;
run;

Result:

ID_a  ID_b  var1  var2  var3 
A     B     12    xxx   yyy 
C     D     17    xxx   yyy 
G     K     32    xxx   yyy

The DATA to DATA Step Macro
Blog: SASnrd

JelmerB · Posted 11-25-2020 05:17 AM

Thanks for sharing this solution! The code runs smoothly on the sample data set. However, in my real date set the ID_a and ID_b are numeric. I think this is the reason I get the following error: Variable ID_a has been defined as both character and numeric (and the same error for ID_b). What do I need to change to work around this?

PeterClemmensen · Posted 11-25-2020 05:29 AM

A small correction in the code below (changed the sample data to numeric IDs)

Let me know if it works 🙂

data have;
input ID_a ID_b var1 var2 $ var3 $;
cards;
1 2  12 xxx yyy
1 3  36 xxx yyy
3 4  17 xxx yyy
4 1  18 xxx yyy
4 6  80 xxx yyy
7 2  20 xxx yyy
7 11 32 xxx yyy
;

data want(drop = id);
   dcl hash h(dataset : 'have', multidata : 'Y', ordered : 'A');
   h.definekey('var1', 'ID_a', 'ID_b');
   h.definedata(all : 'Y');
   h.definedone();
   dcl hiter i('h');

   dcl hash u();
   u.definekey('id');
   u.definedone();

   if 0 then set have;
   id = .;

   do while (i.next()=0);
      if u.check(key : ID_a) and u.check(key : ID_b) then do;
         u.ref(key : ID_a, data : ID_a);
         u.ref(key : ID_b, data : ID_b);
         output;
      end;
   end;
run;

Result:

ID_a  ID_b  var1  var2  var3 
1     2     12    xxx   yyy 
3     4     17    xxx   yyy 
7     11    32    xxx   yyy

The DATA to DATA Step Macro
Blog: SASnrd

JelmerB · Posted 11-25-2020 11:07 AM

This is working for me. Thanks for helping me out!

PeterClemmensen · Posted 11-25-2020 11:15 AM

Anytime, glad to help 🙂

The DATA to DATA Step Macro
Blog: SASnrd

Pair two ID's based on the smallest distance between them and use every ID only 1 time

Re: Pair two ID's based on the smallest distance between them and use every ID only 1 time

Re: Pair two ID's based on the smallest distance between them and use every ID only 1 time

Re: Pair two ID's based on the smallest distance between them and use every ID only 1 time

Re: Pair two ID's based on the smallest distance between them and use every ID only 1 time

Re: Pair two ID's based on the smallest distance between them and use every ID only 1 time

Re: Pair two ID's based on the smallest distance between them and use every ID only 1 time

Re: Pair two ID's based on the smallest distance between them and use every ID only 1 time

Re: Pair two ID's based on the smallest distance between them and use every ID only 1 time

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!