12-12-2014 12:48 AM
Need advice on the below problem.
Got two sets of data(more than 10 Million records) from two different sources.
|XYZ Fin LTD|
|XYZ Ave Ltd|
|1/XYZ PTE LTD|
These fields contain company/organization names.
What I need to do is
1) To create a group field for each of the source field which will be the common names for different rows. Example below
|1/XYZ PTE LTD||XYZ|
2) And to match Source1 to Source 2.
My thought process is if I am able to create a group for the 2 tables/sources it will be good way to match both the sources. .
I am looking for a very optimized way to process this query.
All suggestions are welcome.
Thanks in advance!
12-12-2014 10:26 AM
Good luck in your quest. I had a project were out of about 1500 records there were 18 ways to spell IBM when the people collecting the data were told to enter IBM. The most interesting one was I>B>M>.
My approach would be:
1) get subsets of each data set containing the unique values. ADD a field to manipulate copying this data. The manipulated value is the one to compare with (try NEVER to lose the original value, just causes headaches)
2) decide on standards for relatively common abbreviations: LTD Ltd LTD. Ltd. ,LTD ,Ltd and choose a standard them
3) clean for obvious problems. Your 1/XYX PTE LTD looks like a candidate
4) reduce similar items to a single one
after preparing the data then worry about matching
5) Look for exact matches
after identifying the exact matches start looking at similar starting strings (COMPARE function)
6) the use some of the string comparison functions on remaining not matched such as COMPGED or COMPLEV
I merge two sets of values, SQL Cartesian product, with compged value and sort by that and mark matches manually but only after going through the other steps to reduce the values.
And after identifying the matches I want, I might be tempted to build a format from the RAW values to the cleaned match in one direction and then use that as a base the next time...
12-12-2014 09:48 PM
What you're looking for is Master Data Management (MDM) and creation of a Golden Record SAS MDM Solution
I you haven't this solution licensed then next best would be DataFlux (MDM is a solutions built on DataFlux). Have you a licence for this one (it comes for example as part of the Data Management Standard bundle). You then could use functions like dqmatch() or dqstandardize(). Someone would still need to maintain/amend the DataFlux QKB (Quality Knowledge Base) but you would have quite a good starting point as it already contains organisation names "out-of-the-box".
12-15-2014 09:16 PM
Thanks DBailey for looking into the problem.
So basically what I am looking for is a function similar to DQMATCH() available in Dataflux server.
I want to parse the names and group them and give them a GroupID. like below
1/XYZ PTE LTD