Help using Base SAS procedures

String Grouping & Matching

Reply
Occasional Contributor
Posts: 13

String Grouping & Matching

Hello Everybody,

Need advice on the below problem.

Got two sets of data(more than 10 Million records) from two different sources.

                  Source1
XYZ LTD
XYZ Fin LTD
XYZ Ave Ltd

                 Source2
1/XYZ PTE LTD
XYZPTE LTD
ABCDE LTD.

These fields contain company/organization names.

What I need to do is

1) To create a group field for each of the source field which will be the common names for different rows. Example below

Source2Source2_Group
1/XYZ PTE LTDXYZ
XYZPTE LTDXYZ
ABCDE LTD.ABCDE

2) And to match Source1 to Source 2.

My thought process is if I am able to create a group for the 2 tables/sources it will be good way to match both the sources. .

I am looking for a very optimized way to process this query.

All suggestions are welcome.

Thanks in advance!

Regards,

Suvi

Super Contributor
Posts: 578

Re: String Grouping & Matching

I think your first problem would be to describe the logic you will need. 

Super User
Posts: 11,343

Re: String Grouping & Matching

Good luck in your quest. I had a project were out of about 1500 records there were 18 ways to spell IBM when the people collecting the data were told to enter IBM. The most interesting one was I>B>M>.

My approach would be:

1) get subsets of each data set containing the unique values. ADD a field to manipulate copying this data. The manipulated value is the one to compare with (try NEVER to lose the original value, just causes headaches)

2) decide on standards for relatively common abbreviations: LTD Ltd LTD. Ltd. ,LTD ,Ltd and choose a standard them

3) clean for obvious problems. Your 1/XYX PTE LTD looks like a candidate

4) reduce similar items to a single one

after preparing the data then worry about matching

5) Look for exact matches

after identifying the exact matches start looking at similar starting strings (COMPARE function)

6) the use some of the string comparison functions on remaining not matched such as  COMPGED or COMPLEV

I  merge two sets of values, SQL Cartesian product, with compged value and sort by that and mark matches manually but only after going through the other steps to reduce the values.

And after identifying the matches I want, I might be tempted to build a format from the RAW values to the cleaned match in one direction and then use that as a base the next time...

Respected Advisor
Posts: 4,173

Re: String Grouping & Matching

What you're looking for is Master Data Management (MDM) and creation of a Golden Record SAS MDM Solution

I you haven't this solution licensed then next best would be DataFlux (MDM is a solutions built on DataFlux). Have you a licence for this one (it comes for example as part of the Data Management Standard bundle). You then could use functions like dqmatch() or dqstandardize(). Someone would still need to maintain/amend the DataFlux QKB (Quality Knowledge Base) but you would have quite a good starting point as it already contains organisation names "out-of-the-box".

Else: Everything suggested. It will be a big task with 10M source rows.

Occasional Contributor
Posts: 13

Re: String Grouping & Matching

Thanks DBailey for looking into the problem.

So basically what I am looking for is a function similar to DQMATCH() available in Dataflux server.

I want to parse the names and group them and give them a GroupID. like below

Source2GroupID
 
1/XYZ PTE LTD
1
XYZPTE LTD1
ABCDE LTD.   2
Super Contributor
Posts: 578

Re: String Grouping & Matching

Have you looked at the function compged (which I believe stands for compute generalized edit distance)?

Ask a Question
Discussion stats
  • 5 replies
  • 322 views
  • 6 likes
  • 4 in conversation