Re: String Grouping & Matching

suvi107 · Posted 12-12-2014 12:48 AM

Hello Everybody,

Need advice on the below problem.

Got two sets of data(more than 10 Million records) from two different sources.

Source1

XYZ LTD

XYZ Fin LTD

XYZ Ave Ltd

Source2

1/XYZ PTE LTD

XYZPTE LTD

ABCDE LTD.

These fields contain company/organization names.

What I need to do is

1) To create a group field for each of the source field which will be the common names for different rows. Example below

Source2	Source2_Group
1/XYZ PTE LTD	XYZ
XYZPTE LTD	XYZ
ABCDE LTD.	ABCDE

2) And to match Source1 to Source 2.

My thought process is if I am able to create a group for the 2 tables/sources it will be good way to match both the sources. .

I am looking for a very optimized way to process this query.

All suggestions are welcome.

Thanks in advance!

Regards,

Suvi

DBailey · Posted 12-12-2014 08:50 AM

I think your first problem would be to describe the logic you will need.

ballardw · Posted 12-12-2014 10:26 AM

Good luck in your quest. I had a project were out of about 1500 records there were 18 ways to spell IBM when the people collecting the data were told to enter IBM. The most interesting one was I>B>M>.

My approach would be:

1) get subsets of each data set containing the unique values. ADD a field to manipulate copying this data. The manipulated value is the one to compare with (try NEVER to lose the original value, just causes headaches)

2) decide on standards for relatively common abbreviations: LTD Ltd LTD. Ltd. ,LTD ,Ltd and choose a standard them

3) clean for obvious problems. Your 1/XYX PTE LTD looks like a candidate

4) reduce similar items to a single one

after preparing the data then worry about matching

5) Look for exact matches

after identifying the exact matches start looking at similar starting strings (COMPARE function)

6) the use some of the string comparison functions on remaining not matched such as COMPGED or COMPLEV

I merge two sets of values, SQL Cartesian product, with compged value and sort by that and mark matches manually but only after going through the other steps to reduce the values.

And after identifying the matches I want, I might be tempted to build a format from the RAW values to the cleaned match in one direction and then use that as a base the next time...

Patrick · Posted 12-12-2014 09:48 PM

What you're looking for is Master Data Management (MDM) and creation of a Golden Record SAS MDM Solution

I you haven't this solution licensed then next best would be DataFlux (MDM is a solutions built on DataFlux). Have you a licence for this one (it comes for example as part of the Data Management Standard bundle). You then could use functions like dqmatch() or dqstandardize(). Someone would still need to maintain/amend the DataFlux QKB (Quality Knowledge Base) but you would have quite a good starting point as it already contains organisation names "out-of-the-box".

Else: Everything suggested. It will be a big task with 10M source rows.

suvi107 · Posted 12-15-2014 09:16 PM

Thanks DBailey for looking into the problem.

So basically what I am looking for is a function similar to DQMATCH() available in Dataflux server.

I want to parse the names and group them and give them a GroupID. like below

Source2	GroupID
1/XYZ PTE LTD	1
XYZPTE LTD	1
ABCDE LTD.	2

DBailey · Posted 12-16-2014 08:59 AM

Have you looked at the function compged (which I believe stands for compute generalized edit distance)?

String Grouping & Matching