Hi,
I am working on something like the following.
For a dataset, the original dataset looks like this:
Group
Name
A
Cox & Wilson, PC, CPA
A
Cox & Wilson, PC, CPAs
A
Cox & Wilson, PC, CPAs
A
Jonathan D. Liner
A
Jonathan Liner
B
Memphis Light, Gas & Water
B
Memphis Light, Gas & Water
B
Memphis, Light, Gas and Water
C
Homer Electric Assn
C
Homer Electric Association
C
Homer Electric Association
What I want to do is to categorize similar names within each group. The dataset I want looks like this:
Group
Name
NameGroup
A
Cox & Wilson, PC, CPA
1
A
Cox & Wilson, PC, CPAs
1
A
Cox & Wilson, PC, CPAs
1
A
Jonathan D. Liner
2
A
Jonathan Liner
2
B
Memphis Light, Gas & Water
3
B
Memphis Light, Gas & Water
3
B
Memphis, Light, Gas and Water
3
C
Homer Electric Assn
4
C
Homer Electric Association
4
C
Homer Electric Association
4
My initial thought is to compute string distance (Levenshtein) between each pair of two words and then use some cluster methods. Then I realized there might be some difficulties:
1. First, I have a lot of observations. Computing each pair even within the group can be time-consuming.
2. The cluster method I am familiar with is K-means. However, K-means requires prespecifying number of groups. I am wondering if there are any cluster methods that work better for this problem.
I am wondering if there are any more convenient ways to do that.
That would be great if someone can help out here.
... View more