02-10-2017 03:04 PM
I have a dataset with a list of organization, company names (dataset A) such as walmart, AT & T, Children's Place, Target, a dataset with popular people names (dataset B), and a dataset consisting of text like "Harry met Sally at Walmart." (dataset C)
How can I match text string "Walmart" from dataset C with company name in dataset A and replace it with "comp_name"?
I'd also like to match and replace "Harry" and "Sally" from dataset C with "people_name".
02-10-2017 03:22 PM
Do you need to account for :
Wal-Mart vs walmart vs WALMART vs Wal Mart vs wal mart
Please post more sample code and data if you want code for help.
Otherwise, consider searching on here using the terms "Fuzzy matching"
02-10-2017 03:31 PM
Also what would the final result look like?
Do any of your company names happen to be peoples names? I can think of multiple stores where all or part of the name could show up in your person name data. How do intend to handle that?
02-10-2017 04:22 PM
Thank you all for replies and good questions that inspire me to think more.
First, I believe this is a fuzzy string match that ignore cases, blanks, etc.
Second, I think it's better to produce result like this:
Original text: "Harry met Sally at Walmart."
Editted text: "edit_name met edit_name at edit_name."
since there are cases when company names show up in person names.
The basic idea is to remove any people name, organization/company name, and ideally, location (street names or suffix/abbreation) from a sample text file. I built up a dictionary myself by compiling an excel file of company names and people names from census and other online sources, and then set up a sample text dataset with 10 records, 5 of which carry names and locations, 5 of which don't. I'd like to match the sample text dataset with my dictionaries and then replace them with "edit_name".
Any help on that?
02-10-2017 04:26 PM
Quick way then, if you have a database to look into.
Split the file into word components, using SCAN function to separate out the strings.
Then merge this against your databases and categorize the components and then reform the data into sentences after if desired.
Also, if you have some money look into the Text Analytics available in SAS Enterprise Miner.