Hello everyone,
I have a table with the calculated distance (var1) between customers (ID_a and ID_b). Var2 and var3 contain some other variables that are not relevant for the connection but I want to keep them. The data is sorted on ID_a and var1.
data have;
input (ID_a ID_b var1 var2 var3)($);
cards;
A B 12 xxx yyy
A C 36 xxx yyy
C D 17 xxx yyy
D A 18 xxx yyy
D F 80 xxx yyy
G B 20 xxx yyy
G K 32 xxx yyy
run;
Now I want to pair ID's with the smallest distance between them, and the ID may only be used once in the output. So the output should look like this:
data want;
input (ID_a ID_b var1 var2 var3)($);
cards;
A B 12 xxx yyy
C D 17 xxx yyy
G K 32 xxx yyy
run;
The goal is to get a list where every ID is paired to one other ID, with the closest possible distance. Of course there are multiple combinations, because A can be the closest to B, while at the same time B can be more close to C. If possible it would be nice to get the combinations with the lowest total distance.
The data set is ~2600 rows with 224 unique ID's. So the output should be 112 rows with a ID_a - ID_b combination.
What would be a solution to do this? I have been working on PROC SORT with NODUPKEY, but that is not going to work because each row consists of a unique combination. I am thinking about using a macro to select every first possibility of ID_a, while checking if the ID already has been used in ID_a or ID_b. If no, write the line to the output file and proceed to the next ID_a. If yes, then proceed to the second possibility/connection of ID_a and run the macro again. I can't find any example of such a routine, and I’m not sure this is the best solution. Maybe there are other ways to do this?
Thank you in advance for your suggestions!
A small correction in the code below (changed the sample data to numeric IDs)
Let me know if it works 🙂
data have;
input ID_a ID_b var1 var2 $ var3 $;
cards;
1 2 12 xxx yyy
1 3 36 xxx yyy
3 4 17 xxx yyy
4 1 18 xxx yyy
4 6 80 xxx yyy
7 2 20 xxx yyy
7 11 32 xxx yyy
;
data want(drop = id);
dcl hash h(dataset : 'have', multidata : 'Y', ordered : 'A');
h.definekey('var1', 'ID_a', 'ID_b');
h.definedata(all : 'Y');
h.definedone();
dcl hiter i('h');
dcl hash u();
u.definekey('id');
u.definedone();
if 0 then set have;
id = .;
do while (i.next()=0);
if u.check(key : ID_a) and u.check(key : ID_b) then do;
u.ref(key : ID_a, data : ID_a);
u.ref(key : ID_b, data : ID_b);
output;
end;
end;
run;
Result:
ID_a ID_b var1 var2 var3 1 2 12 xxx yyy 3 4 17 xxx yyy 7 11 32 xxx yyy
Here is a solution along the lines of the logic you outline yourself
First, verify that the approach works as desired on your sample data set. Then test if it works on your actual data. 2600 Obs is no problem for the hash to handle. Let me know if it works for you.
data have;
input ID_a $ ID_b $ var1 var2 $ var3 $;
cards;
A B 12 xxx yyy
A C 36 xxx yyy
C D 17 xxx yyy
D A 18 xxx yyy
D F 80 xxx yyy
G B 20 xxx yyy
G K 32 xxx yyy
;
data want(drop = id);
dcl hash h(dataset : 'have', multidata : 'Y', ordered : 'A');
h.definekey('var1', 'ID_a', 'ID_b');
h.definedata(all : 'Y');
h.definedone();
dcl hiter i('h');
dcl hash u();
u.definekey('id');
u.definedone();
if 0 then set have;
id = ' ';
do while (i.next()=0);
if u.check(key : ID_a) and u.check(key : ID_b) then do;
u.ref(key : ID_a, data : ID_a);
u.ref(key : ID_b, data : ID_b);
output;
end;
end;
run;
Result:
ID_a ID_b var1 var2 var3 A B 12 xxx yyy C D 17 xxx yyy G K 32 xxx yyy
Thanks for sharing this solution! The code runs smoothly on the sample data set. However, in my real date set the ID_a and ID_b are numeric. I think this is the reason I get the following error: Variable ID_a has been defined as both character and numeric (and the same error for ID_b). What do I need to change to work around this?
A small correction in the code below (changed the sample data to numeric IDs)
Let me know if it works 🙂
data have;
input ID_a ID_b var1 var2 $ var3 $;
cards;
1 2 12 xxx yyy
1 3 36 xxx yyy
3 4 17 xxx yyy
4 1 18 xxx yyy
4 6 80 xxx yyy
7 2 20 xxx yyy
7 11 32 xxx yyy
;
data want(drop = id);
dcl hash h(dataset : 'have', multidata : 'Y', ordered : 'A');
h.definekey('var1', 'ID_a', 'ID_b');
h.definedata(all : 'Y');
h.definedone();
dcl hiter i('h');
dcl hash u();
u.definekey('id');
u.definedone();
if 0 then set have;
id = .;
do while (i.next()=0);
if u.check(key : ID_a) and u.check(key : ID_b) then do;
u.ref(key : ID_a, data : ID_a);
u.ref(key : ID_b, data : ID_b);
output;
end;
end;
run;
Result:
ID_a ID_b var1 var2 var3 1 2 12 xxx yyy 3 4 17 xxx yyy 7 11 32 xxx yyy
Anytime, glad to help 🙂
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.