Hello Everybody,
Need advice on the below problem.
Got two sets of data(more than 10 Million records) from two different sources.
Source1 |
XYZ LTD |
XYZ Fin LTD |
XYZ Ave Ltd |
Source2 |
1/XYZ PTE LTD |
XYZPTE LTD |
ABCDE LTD. |
These fields contain company/organization names.
What I need to do is
1) To create a group field for each of the source field which will be the common names for different rows. Example below
Source2 | Source2_Group |
1/XYZ PTE LTD | XYZ |
XYZPTE LTD | XYZ |
ABCDE LTD. | ABCDE |
2) And to match Source1 to Source 2.
My thought process is if I am able to create a group for the 2 tables/sources it will be good way to match both the sources. .
I am looking for a very optimized way to process this query.
All suggestions are welcome.
Thanks in advance!
Regards,
Suvi
I think your first problem would be to describe the logic you will need.
Good luck in your quest. I had a project were out of about 1500 records there were 18 ways to spell IBM when the people collecting the data were told to enter IBM. The most interesting one was I>B>M>.
My approach would be:
1) get subsets of each data set containing the unique values. ADD a field to manipulate copying this data. The manipulated value is the one to compare with (try NEVER to lose the original value, just causes headaches)
2) decide on standards for relatively common abbreviations: LTD Ltd LTD. Ltd. ,LTD ,Ltd and choose a standard them
3) clean for obvious problems. Your 1/XYX PTE LTD looks like a candidate
4) reduce similar items to a single one
after preparing the data then worry about matching
5) Look for exact matches
after identifying the exact matches start looking at similar starting strings (COMPARE function)
6) the use some of the string comparison functions on remaining not matched such as COMPGED or COMPLEV
I merge two sets of values, SQL Cartesian product, with compged value and sort by that and mark matches manually but only after going through the other steps to reduce the values.
And after identifying the matches I want, I might be tempted to build a format from the RAW values to the cleaned match in one direction and then use that as a base the next time...
What you're looking for is Master Data Management (MDM) and creation of a Golden Record SAS MDM Solution
I you haven't this solution licensed then next best would be DataFlux (MDM is a solutions built on DataFlux). Have you a licence for this one (it comes for example as part of the Data Management Standard bundle). You then could use functions like dqmatch() or dqstandardize(). Someone would still need to maintain/amend the DataFlux QKB (Quality Knowledge Base) but you would have quite a good starting point as it already contains organisation names "out-of-the-box".
Else: Everything suggested. It will be a big task with 10M source rows.
Thanks DBailey for looking into the problem.
So basically what I am looking for is a function similar to DQMATCH() available in Dataflux server.
I want to parse the names and group them and give them a GroupID. like below
Source2 | GroupID |
1/XYZ PTE LTD | 1 |
XYZPTE LTD | 1 |
ABCDE LTD. | 2 |
Have you looked at the function compged (which I believe stands for compute generalized edit distance)?
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.