BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
JelmerB
Calcite | Level 5

Hello everyone,

 

have a table with the calculated distance (var1) between customers (ID_a and ID_b). Var2 and var3 contain some other variables that are not relevant for the connection but I want to keep them. The data is sorted on ID_a and var1.

 

 

data have;
input (ID_a ID_b var1 var2 var3)($);
cards;
	A B 12 xxx yyy
	A C 36 xxx yyy
	C D 17 xxx yyy
	D A 18 xxx yyy
	D F 80 xxx yyy
	G B 20 xxx yyy
	G K 32 xxx yyy

run;

 

Now I want to pair ID's with the smallest distance between them, and the ID may only be used once in the output. So the output should look like this:

data want;
input (ID_a ID_b var1 var2 var3)($);
cards;
   A B 12 xxx yyy
   C D 17 xxx yyy
   G K 32 xxx yyy

run;

 

The goal is to get a list where every ID is paired to one other ID, with the closest possible distance. Of course there are multiple combinations, because A can be the closest to B, while at the same time B can be more close to C. If possible it would be nice to get the combinations with the lowest total distance.

 

The data set is ~2600 rows with 224 unique ID's. So the output should be 112 rows with a ID_a - ID_b combination.

 

What would be a solution to do this? I have been working on PROC SORT with NODUPKEY, but that is not going to work because each row consists of a unique combination. I am thinking about using a macro to select every first possibility of ID_a, while checking if the ID already has been used in ID_a or ID_b. If no, write the line to the output file and proceed to the next ID_a. If yes, then proceed to the second possibility/connection of ID_a and run the macro again. I can't find any example of such a routine, and I’m not sure this is the best solution. Maybe there are other ways to do this?

 

Thank you in advance for your suggestions!

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
PeterClemmensen
Tourmaline | Level 20

A small correction in the code below (changed the sample data to numeric IDs)

 

Let me know if it works 🙂

 

data have;
input ID_a ID_b var1 var2 $ var3 $;
cards;
1 2  12 xxx yyy
1 3  36 xxx yyy
3 4  17 xxx yyy
4 1  18 xxx yyy
4 6  80 xxx yyy
7 2  20 xxx yyy
7 11 32 xxx yyy
;

data want(drop = id);
   dcl hash h(dataset : 'have', multidata : 'Y', ordered : 'A');
   h.definekey('var1', 'ID_a', 'ID_b');
   h.definedata(all : 'Y');
   h.definedone();
   dcl hiter i('h');

   dcl hash u();
   u.definekey('id');
   u.definedone();

   if 0 then set have;
   id = .;

   do while (i.next()=0);
      if u.check(key : ID_a) and u.check(key : ID_b) then do;
         u.ref(key : ID_a, data : ID_a);
         u.ref(key : ID_b, data : ID_b);
         output;
      end;
   end;
run;

Result:

 

ID_a  ID_b  var1  var2  var3 
1     2     12    xxx   yyy 
3     4     17    xxx   yyy 
7     11    32    xxx   yyy 

View solution in original post

7 REPLIES 7
Reeza
Super User
This seems like a linear optimization problem - perhaps the traveling salesmen problem?
Do you have a license for SAS/OR? It is included in On Demand for Academics - seems like you want the traveling salesman problem with a directed graph since you know which nodes you can travel along.
https://documentation.sas.com/?cdcId=pgmsascdc&cdcVersion=9.4_3.4&docsetId=ornoaug&docsetTarget=orno...

Reeza
Super User
If you're not familiar with the traveling salesman problem here's the Wikipedia link:
https://en.wikipedia.org/wiki/Travelling_salesman_problem
PeterClemmensen
Tourmaline | Level 20

Here is a solution along the lines of the logic you outline yourself

 

  • Read the entire data set into an ordered hash object h.
  • Create an additional hash object to keep track of the ids already encountered.
  • Traverse one item at the time.
  • If both IDs already exist in U, they have already been encountered. Therefore nothing happens. If not, we add them to U and output the observation to the desired data set.

First, verify that the approach works as desired on your sample data set. Then test if it works on your actual data. 2600 Obs is no problem for the hash to handle. Let me know if it works for you.

 

data have;
input ID_a $ ID_b $ var1 var2 $ var3 $;
cards;
A B 12 xxx yyy
A C 36 xxx yyy
C D 17 xxx yyy
D A 18 xxx yyy
D F 80 xxx yyy
G B 20 xxx yyy
G K 32 xxx yyy
;

data want(drop = id);
   dcl hash h(dataset : 'have', multidata : 'Y', ordered : 'A');
   h.definekey('var1', 'ID_a', 'ID_b');
   h.definedata(all : 'Y');
   h.definedone();
   dcl hiter i('h');

   dcl hash u();
   u.definekey('id');
   u.definedone();

   if 0 then set have;
   id = '        ';

   do while (i.next()=0);
      if u.check(key : ID_a) and u.check(key : ID_b) then do;
         u.ref(key : ID_a, data : ID_a);
         u.ref(key : ID_b, data : ID_b);
         output;
      end;
   end;
run;

 

Result:

 

ID_a  ID_b  var1  var2  var3 
A     B     12    xxx   yyy 
C     D     17    xxx   yyy 
G     K     32    xxx   yyy 

 

JelmerB
Calcite | Level 5

Thanks for sharing this solution! The code runs smoothly on the sample data set. However, in my real date set the ID_a and ID_b are numeric. I think this is the reason I get the following error: Variable ID_a has been defined as both character and numeric (and the same error for ID_b). What do I need to change to work around this?

PeterClemmensen
Tourmaline | Level 20

A small correction in the code below (changed the sample data to numeric IDs)

 

Let me know if it works 🙂

 

data have;
input ID_a ID_b var1 var2 $ var3 $;
cards;
1 2  12 xxx yyy
1 3  36 xxx yyy
3 4  17 xxx yyy
4 1  18 xxx yyy
4 6  80 xxx yyy
7 2  20 xxx yyy
7 11 32 xxx yyy
;

data want(drop = id);
   dcl hash h(dataset : 'have', multidata : 'Y', ordered : 'A');
   h.definekey('var1', 'ID_a', 'ID_b');
   h.definedata(all : 'Y');
   h.definedone();
   dcl hiter i('h');

   dcl hash u();
   u.definekey('id');
   u.definedone();

   if 0 then set have;
   id = .;

   do while (i.next()=0);
      if u.check(key : ID_a) and u.check(key : ID_b) then do;
         u.ref(key : ID_a, data : ID_a);
         u.ref(key : ID_b, data : ID_b);
         output;
      end;
   end;
run;

Result:

 

ID_a  ID_b  var1  var2  var3 
1     2     12    xxx   yyy 
3     4     17    xxx   yyy 
7     11    32    xxx   yyy 
JelmerB
Calcite | Level 5
This is working for me. Thanks for helping me out!

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 815 views
  • 0 likes
  • 3 in conversation