Hello,
I am trying to create the best unique match between old customer accounts (X) and new customer accounts (Y). The "best" match is defined as the fewest number of days observed between account opening (Z) between X and Y. There should be a one-to-one relationship between X and Y, so once a given pair has been matched, the accounts in that pair should not be used to evaluate the best match for future pairs.
This seems like a shortest-path problem - creating the best set of pairs which minimizes the observed days difference. I was looking at hash objects as a solution, but I'm not sure if that is a viable because I need to reference both keys (X and Y) when evaluating if a match is acceptable.
data have;
input x y z;
datalines;
1 4 10
1 5 25
2 4 100
2 5 125
3 4 700
3 5 150
3 6 180
7 6 180
7 8 200
;
In the sample data above:
The output dataset would look something like this. I just want a clean set of pairs so that I can join the X value back to the source dataset and flag/list the match pair:
data want;
input x y;
datalines;
1 4 10
2 5 125
3 6 180
7 8 200
;
Any comments or suggestions are greatly appreciated!
If I understood your question.
data have;
input x y z;
datalines;
1 4 10
1 5 25
2 4 100
2 5 125
3 4 700
3 5 150
3 6 180
7 6 180
7 8 200
;
proc sort data=have out=temp; by z;run;
data want;
if _n_=1 then do;
if 0 then set have;
declare hash h1(dataset:'have');
h1.definekey('x');
h1.definedone();
declare hash h2(dataset:'have');
h2.definekey('y');
h2.definedone();
end;
set temp;
if h1.check() = 0 and h2.check() = 0 then do;
output;
h1.remove();
h2.remove();
end;
run;
Since you are using "days" then I suggest using actual dates instead of this "number of days" which you haven't really identified where it might come from, possible two dates, a start date and and end date. Then compare the end of one to the start of another, only where the start is AFTER the end of the first. Then perhaps a lot of this:
- The best match for X2 would have been Y4, but I want to remove Y4 from matching consideration because it matched better with X1.
- X2 and Y5 is a better match than X3 and Y5, so I want to match X2 and Y5 and remove them from consideration.
- X3 could have matched with Y5, but since we have already matched Y4 and Y5, the next available option is Y6.
might not be needed.
And any such match is going to require a lot of manual intervention because I will bet that with any reasonable sized data set you will have clusters of values on certain dates such as Monday (processing stuff from the last week/previous days), first or last of a month (because that is when some step is actually performed or the way the "accounts" are set up such that they end on a month boundary). So I hope that you some additional data to use a tie breaker.
Do not be surprised if your actual data has overlaps where one account is started before another ends, start1<start2<end1<end2, or one duration is completely a subset of another Start1< start2<end2<end1 or even overlaps between 3 or more.
Hello,
The days are calculated as the absolute value of the difference between the end date of X (old account) and the start date of Y (new account). It is absolutely possible for the start date of Y to occur either before or after the end date of X due to the nature of how the accounts are created, which is why the absolute value is used. I already have my own set of rules for what is acceptable as # of days difference, which I would apply outside of the context of this example.
I have already determined that the accounts in the example are associated with each other through connected components analysis (using proc optnet - they are inherently linked by related application information). Given my confidence in that association, my primary concern is with optimally matching the pairs given their relative closeness in days. Any additional data not captured in my current consideration would render this optimization redundant because it would provide sufficient information to directly link pairs.
I am looking for help to develop some type of tree methodology that finds the best combination of pairs with a minimum total sum of "days", without any re-use or duplication of prior matched X and Y accounts. I included the duplicate situation as a possible rare outcome that the methodology would need to work around for determining the minimum total sum.
If I understood your question.
data have;
input x y z;
datalines;
1 4 10
1 5 25
2 4 100
2 5 125
3 4 700
3 5 150
3 6 180
7 6 180
7 8 200
;
proc sort data=have out=temp; by z;run;
data want;
if _n_=1 then do;
if 0 then set have;
declare hash h1(dataset:'have');
h1.definekey('x');
h1.definedone();
declare hash h2(dataset:'have');
h2.definekey('y');
h2.definedone();
end;
set temp;
if h1.check() = 0 and h2.check() = 0 then do;
output;
h1.remove();
h2.remove();
end;
run;
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.