Dear All,
Hope you are well.
I have a question regarding matches validation. Basically our objective is to match dataset 1 to dataset 2 using variables such as (Name, Surname etc). Dataset 2 is our master dataset.
Our process works as follows
Data processing (clean variables , get rid of what we aren’t using)
Determine what keys are available on input to use as match keys (16)
Create every combination of keys available.
Calculate the unique of the keys to be taken through for matching. (the uniqueness is the total number of unique values for each key / master dataset). If the match keys are over the set threshold (95%) then those keys are taking through to the matching process. Because the master dataset is very large we use a 20% sample of the master dataset in this stage but match to the whole master dataset in the matching process.
Matching process ,
3 tables ,
keys table (all keys we have used for matching ,
Matched dataset (all records we have matched)
Unmatched dataset (all records we haven’t matched)
We take each key and loop round the master dataset to get all unique matches on that key. When we find a match we exclude it from the unmatched dataset and insert into the matched dataset. We keep iterating until all keys are exhausted or we run out of input data. On the first iteration all records in the unmatched dataset are records on the input file.
The problem with deterministic matching is validating matches. We expect if we set a threshold of 95% then there is a 5% chance of error (when the last match key is 95% unique).
Does anyone have any ideas what would be the best approach to validate results to confirm they are correct and to be able to provide an estimate of what % of matches are correct?
Your help would be much appreciated.
Many Thanks
... View more