Match text strings in seperate datasets

Reply
New Contributor
Posts: 4

Match text strings in seperate datasets

I have a dataset with a list of organization, company names (dataset A) such as walmart, AT & T, Children's Place, Target, a dataset with popular people names (dataset B), and a dataset consisting of text like "Harry met Sally at Walmart." (dataset C)

 

How can I match text string "Walmart" from dataset C with company name in dataset A and replace it with "comp_name"?

I'd also like to match and replace "Harry" and "Sally" from dataset C with "people_name".

 

Thanks!

Grand Advisor
Posts: 17,424

Re: Match text strings in seperate datasets

Do you need to account for :

Wal-Mart vs walmart vs WALMART vs Wal Mart vs wal mart

 

Please post more sample code and data if you want code for help.

 

Otherwise, consider searching on here using the terms "Fuzzy matching"

 

 

Grand Advisor
Posts: 10,238

Re: Match text strings in seperate datasets

Also what would the final result look like?

 

Do any of your company names happen to be peoples names? I can think of multiple stores where all or part of the name could show up in your person name data. How do intend to handle that?

New Contributor
Posts: 4

Re: Match text strings in seperate datasets

Thank you all for replies and good questions that inspire me to think more.

 

First, I believe this is a fuzzy string match that ignore cases, blanks, etc.

 

Second, I think it's better to produce result like this:

Original text: "Harry met Sally at Walmart."

Editted text: "edit_name met edit_name at edit_name."

since there are cases when company names show up in person names.

 

The basic idea is to remove any people name, organization/company name, and ideally, location (street names or suffix/abbreation) from a sample text file. I built up a dictionary myself by compiling an excel file of company names and people names from census and other online sources, and then set up a sample text dataset with 10 records, 5 of which carry names and locations, 5 of which don't. I'd like to match the sample text dataset with my dictionaries and then replace them with "edit_name".

 

Any help on that?

Thanks!

 

 

 

 

Grand Advisor
Posts: 17,424

Re: Match text strings in seperate datasets

Quick way then, if you have a database to look into. 

 

Split the file into word components, using SCAN function to separate out the strings. 

 Result:

 

SentenceID Word

1 Harry

1 met

1 Sally

1 at

1 Walmart

2 Jack

2 and 

2 Jill

2 Went

2 Up 

2 a 

2 hill

 

Then merge this against your databases and categorize the components and then reform the data into sentences after if desired.

 

Also, if you have some money look into the Text Analytics available in SAS Enterprise Miner.

Valued Guide
Posts: 2,174

Re: Match text strings in seperate datasets

Rather than use the concept MERGE, think of solutions to this problem as a JOIN...
That leads you to PROC SQL with fuzzy logic in the ON conditions of the join
Ask a Question
Discussion stats
  • 5 replies
  • 139 views
  • 0 likes
  • 4 in conversation