BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
reagov
Calcite | Level 5

Hello,

 

I am trying to create the best unique match between old customer accounts (X) and new customer accounts (Y). The "best" match is defined as the fewest number of days observed between account opening (Z) between X and Y. There should be a one-to-one relationship between X and Y, so once a given pair has been matched, the accounts in that pair should not be used to evaluate the best match for future pairs. 

 

This seems like a shortest-path problem - creating the best set of pairs which minimizes the observed days difference. I was looking at hash objects as a solution, but I'm not sure if that is a viable because I need to reference both keys (X and Y) when evaluating if a match is acceptable.

 

 

data have;
input x y z;
datalines;
1 4 10
1 5 25
2 4 100
2 5 125
3 4 700
3 5 150
3 6 180
7 6 180
7 8 200
;

 

In the sample data above:

  1. The best match for the old account (X) = 1 is the new account (Y) = 4, because it has by far the fewest number of days observed (Z) between them. 
  2. The best match for X2 would have been Y4, but I want to remove Y4 from matching consideration because it matched better with X1.
    • X2 and Y5 is a better match than X3 and Y5, so I want to match X2 and Y5 and remove them from consideration.
  3. X3 could have matched with Y5, but since we have already matched Y4 and Y5, the next available option is Y6.
  4. X7 is an interesting example because it could have "tied" with X3 when matching with Y6. There is no way to otherwise rank this pair, so assuming the algorithm is moving sequentially through a dataset sorted by X, it would be acceptable for X7 to receive the match despite the tie.
  5. Assuming X3 and Y6 are excluded from pairing, the only possible match for X7 is now Y8.

 

The output dataset would look something like this. I just want a clean set of pairs so that I can join the X value back to the source dataset and flag/list the match pair:

 

data want;
input x y;
datalines;
1 4 10
2 5 125
3 6 180
7 8 200
;

 

 

Any comments or suggestions are greatly appreciated!

 

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User

If I understood your question.

 

 

data have;
input x y z;
datalines;
1 4 10
1 5 25
2 4 100
2 5 125
3 4 700
3 5 150
3 6 180
7 6 180
7 8 200
;



proc sort data=have out=temp; by z;run;

data want;
 if _n_=1 then do;
  if 0 then set have;
  declare hash h1(dataset:'have');
  h1.definekey('x');
  h1.definedone();

  declare hash h2(dataset:'have');
  h2.definekey('y');
  h2.definedone();
 end;
set temp;
if h1.check() = 0 and h2.check() = 0 then do;
  output;
  h1.remove();
  h2.remove();
end;
run;

View solution in original post

3 REPLIES 3
ballardw
Super User

Since you are using "days" then I suggest using actual dates instead of this "number of days" which you haven't really identified where it might come from, possible two dates, a start date and and end date. Then compare the end of one to the start of another, only where the start is AFTER the end of the first. Then perhaps a lot of this:

 

 

  • The best match for X2 would have been Y4, but I want to remove Y4 from matching consideration because it matched better with X1.
    • X2 and Y5 is a better match than X3 and Y5, so I want to match X2 and Y5 and remove them from consideration.
  • X3 could have matched with Y5, but since we have already matched Y4 and Y5, the next available option is Y6.

 

might not be needed.

 

And any such match is going to require a lot of manual intervention because I will bet that with any reasonable sized data set you will have clusters of values on certain dates such as Monday (processing stuff from the last week/previous days), first or last of a month (because that is when some step is actually performed or the way the "accounts" are set up such that they end on a month boundary). So I hope that you some additional data to use a tie breaker.

 

Do not be surprised if your actual data has overlaps where one account is started before another ends, start1<start2<end1<end2, or one duration is completely a subset of another  Start1< start2<end2<end1 or even overlaps between 3 or more.

 

reagov
Calcite | Level 5

Hello,

 

The days are calculated as the absolute value of the difference between the end date of X (old account) and the start date of Y (new account). It is absolutely possible for the start date of Y to occur either before or after the end date of X due to the nature of how the accounts are created, which is why the absolute value is used. I already have my own set of rules for what is acceptable as # of days difference, which I would apply outside of the context of this example.

 

I have already determined that the accounts in the example are associated with each other through connected components analysis (using proc optnet -  they are inherently linked by related application information). Given my confidence in that association, my primary concern is with optimally matching the pairs given their relative closeness in days. Any additional data not captured in my current consideration would render this optimization redundant because it would provide sufficient information to directly link pairs.

 

I am looking for help to develop some type of tree methodology that finds the best combination of pairs with a minimum total sum of "days", without any re-use or duplication of prior matched X and Y accounts. I included the duplicate situation as a possible rare outcome that the methodology would need to work around for determining the minimum total sum.

 

Ksharp
Super User

If I understood your question.

 

 

data have;
input x y z;
datalines;
1 4 10
1 5 25
2 4 100
2 5 125
3 4 700
3 5 150
3 6 180
7 6 180
7 8 200
;



proc sort data=have out=temp; by z;run;

data want;
 if _n_=1 then do;
  if 0 then set have;
  declare hash h1(dataset:'have');
  h1.definekey('x');
  h1.definedone();

  declare hash h2(dataset:'have');
  h2.definekey('y');
  h2.definedone();
 end;
set temp;
if h1.check() = 0 and h2.check() = 0 then do;
  output;
  h1.remove();
  h2.remove();
end;
run;

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 3 replies
  • 897 views
  • 1 like
  • 3 in conversation