Hello, I am trying to create the best unique match between old customer accounts (X) and new customer accounts (Y). The "best" match is defined as the fewest number of days observed between account opening (Z) between X and Y. There should be a one-to-one relationship between X and Y, so once a given pair has been matched, the accounts in that pair should not be used to evaluate the best match for future pairs. This seems like a shortest-path problem - creating the best set of pairs which minimizes the observed days difference. I was looking at hash objects as a solution, but I'm not sure if that is a viable because I need to reference both keys (X and Y) when evaluating if a match is acceptable. data have;
input x y z;
datalines;
1 4 10
1 5 25
2 4 100
2 5 125
3 4 700
3 5 150
3 6 180
7 6 180
7 8 200
; In the sample data above: The best match for the old account (X) = 1 is the new account (Y) = 4, because it has by far the fewest number of days observed (Z) between them. The best match for X2 would have been Y4, but I want to remove Y4 from matching consideration because it matched better with X1. X2 and Y5 is a better match than X3 and Y5, so I want to match X2 and Y5 and remove them from consideration. X3 could have matched with Y5, but since we have already matched Y4 and Y5, the next available option is Y6. X7 is an interesting example because it could have "tied" with X3 when matching with Y6. There is no way to otherwise rank this pair, so assuming the algorithm is moving sequentially through a dataset sorted by X, it would be acceptable for X7 to receive the match despite the tie. Assuming X3 and Y6 are excluded from pairing, the only possible match for X7 is now Y8. The output dataset would look something like this. I just want a clean set of pairs so that I can join the X value back to the source dataset and flag/list the match pair: data want;
input x y;
datalines;
1 4 10
2 5 125
3 6 180
7 8 200
; Any comments or suggestions are greatly appreciated!
... View more