About reagov

reagov · ‎01-26-2022

Hello, The days are calculated as the absolute value of the difference between the end date of X (old account) and the start date of Y (new account). It is absolutely possible for the start date of Y to occur either before or after the end date of X due to the nature of how the accounts are created, which is why the absolute value is used. I already have my own set of rules for what is acceptable as # of days difference, which I would apply outside of the context of this example. I have already determined that the accounts in the example are associated with each other through connected components analysis (using proc optnet - they are inherently linked by related application information). Given my confidence in that association, my primary concern is with optimally matching the pairs given their relative closeness in days. Any additional data not captured in my current consideration would render this optimization redundant because it would provide sufficient information to directly link pairs. I am looking for help to develop some type of tree methodology that finds the best combination of pairs with a minimum total sum of "days", without any re-use or duplication of prior matched X and Y accounts. I included the duplicate situation as a possible rare outcome that the methodology would need to work around for determining the minimum total sum.

reagov · ‎01-26-2022

Hello, I am trying to create the best unique match between old customer accounts (X) and new customer accounts (Y). The "best" match is defined as the fewest number of days observed between account opening (Z) between X and Y. There should be a one-to-one relationship between X and Y, so once a given pair has been matched, the accounts in that pair should not be used to evaluate the best match for future pairs. This seems like a shortest-path problem - creating the best set of pairs which minimizes the observed days difference. I was looking at hash objects as a solution, but I'm not sure if that is a viable because I need to reference both keys (X and Y) when evaluating if a match is acceptable. data have; input x y z; datalines; 1 4 10 1 5 25 2 4 100 2 5 125 3 4 700 3 5 150 3 6 180 7 6 180 7 8 200 ; In the sample data above: The best match for the old account (X) = 1 is the new account (Y) = 4, because it has by far the fewest number of days observed (Z) between them. The best match for X2 would have been Y4, but I want to remove Y4 from matching consideration because it matched better with X1. X2 and Y5 is a better match than X3 and Y5, so I want to match X2 and Y5 and remove them from consideration. X3 could have matched with Y5, but since we have already matched Y4 and Y5, the next available option is Y6. X7 is an interesting example because it could have "tied" with X3 when matching with Y6. There is no way to otherwise rank this pair, so assuming the algorithm is moving sequentially through a dataset sorted by X, it would be acceptable for X7 to receive the match despite the tie. Assuming X3 and Y6 are excluded from pairing, the only possible match for X7 is now Y8. The output dataset would look something like this. I just want a clean set of pairs so that I can join the X value back to the source dataset and flag/list the match pair: data want; input x y; datalines; 1 4 10 2 5 125 3 6 180 7 8 200 ; Any comments or suggestions are greatly appreciated!

Online Status	Offline
Date Last Visited	‎01-27-2022 04:28 PM

Re: Unique account matching based on date distance

Unique account matching based on date distance

Re: Unique account matching based on date distance

Unique account matching based on date distance