BookmarkSubscribeRSS Feed
suvi107
Calcite | Level 5

Hello Everybody,

Need advice on the below problem.

Got two sets of data(more than 10 Million records) from two different sources.

                  Source1
XYZ LTD
XYZ Fin LTD
XYZ Ave Ltd

                 Source2
1/XYZ PTE LTD
XYZPTE LTD
ABCDE LTD.

These fields contain company/organization names.

What I need to do is

1) To create a group field for each of the source field which will be the common names for different rows. Example below

Source2Source2_Group
1/XYZ PTE LTDXYZ
XYZPTE LTDXYZ
ABCDE LTD.ABCDE

2) And to match Source1 to Source 2.

My thought process is if I am able to create a group for the 2 tables/sources it will be good way to match both the sources. .

I am looking for a very optimized way to process this query.

All suggestions are welcome.

Thanks in advance!

Regards,

Suvi

5 REPLIES 5
DBailey
Lapis Lazuli | Level 10

I think your first problem would be to describe the logic you will need. 

ballardw
Super User

Good luck in your quest. I had a project were out of about 1500 records there were 18 ways to spell IBM when the people collecting the data were told to enter IBM. The most interesting one was I>B>M>.

My approach would be:

1) get subsets of each data set containing the unique values. ADD a field to manipulate copying this data. The manipulated value is the one to compare with (try NEVER to lose the original value, just causes headaches)

2) decide on standards for relatively common abbreviations: LTD Ltd LTD. Ltd. ,LTD ,Ltd and choose a standard them

3) clean for obvious problems. Your 1/XYX PTE LTD looks like a candidate

4) reduce similar items to a single one

after preparing the data then worry about matching

5) Look for exact matches

after identifying the exact matches start looking at similar starting strings (COMPARE function)

6) the use some of the string comparison functions on remaining not matched such as  COMPGED or COMPLEV

I  merge two sets of values, SQL Cartesian product, with compged value and sort by that and mark matches manually but only after going through the other steps to reduce the values.

And after identifying the matches I want, I might be tempted to build a format from the RAW values to the cleaned match in one direction and then use that as a base the next time...

Patrick
Opal | Level 21

What you're looking for is Master Data Management (MDM) and creation of a Golden Record SAS MDM Solution

I you haven't this solution licensed then next best would be DataFlux (MDM is a solutions built on DataFlux). Have you a licence for this one (it comes for example as part of the Data Management Standard bundle). You then could use functions like dqmatch() or dqstandardize(). Someone would still need to maintain/amend the DataFlux QKB (Quality Knowledge Base) but you would have quite a good starting point as it already contains organisation names "out-of-the-box".

Else: Everything suggested. It will be a big task with 10M source rows.

suvi107
Calcite | Level 5

Thanks DBailey for looking into the problem.

So basically what I am looking for is a function similar to DQMATCH() available in Dataflux server.

I want to parse the names and group them and give them a GroupID. like below

Source2GroupID
 
1/XYZ PTE LTD
1
XYZPTE LTD1
ABCDE LTD.   2
DBailey
Lapis Lazuli | Level 10

Have you looked at the function compged (which I believe stands for compute generalized edit distance)?

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 954 views
  • 6 likes
  • 4 in conversation