Solved: Cluster similar strings into group

daradanye · Posted 05-03-2021 08:34 PM

Hi,

I am working on something like the following.

For a dataset, the original dataset looks like this:

Group	Name
A	Cox & Wilson, PC, CPA
A	Cox & Wilson, PC, CPAs
A	Cox & Wilson, PC, CPAs
A	Jonathan D. Liner
A	Jonathan Liner
B	Memphis Light, Gas & Water
B	Memphis Light, Gas & Water
B	Memphis, Light, Gas and Water
C	Homer Electric Assn
C	Homer Electric Association
C	Homer Electric Association

What I want to do is to categorize similar names within each group. The dataset I want looks like this:

Group	Name	NameGroup
A	Cox & Wilson, PC, CPA	1
A	Cox & Wilson, PC, CPAs	1
A	Cox & Wilson, PC, CPAs	1
A	Jonathan D. Liner	2
A	Jonathan Liner	2
B	Memphis Light, Gas & Water	3
B	Memphis Light, Gas & Water	3
B	Memphis, Light, Gas and Water	3
C	Homer Electric Assn	4
C	Homer Electric Association	4
C	Homer Electric Association	4

My initial thought is to compute string distance (Levenshtein) between each pair of two words and then use some cluster methods. Then I realized there might be some difficulties:

1. First, I have a lot of observations. Computing each pair even within the group can be time-consuming.

2. The cluster method I am familiar with is K-means. However, K-means requires prespecifying number of groups. I am wondering if there are any cluster methods that work better for this problem.

I am wondering if there are any more convenient ways to do that.

That would be great if someone can help out here.

andreas_lds · Posted 05-04-2021 02:49 AM

If you do have a lot of obs anything will be time-consuming.

The idea of comparing pairs is interesting, but i would start by cleaning the strings and use complev() on the name variable.

Cleaning = use compress() + lowcase() + compbl()

Here is an idea:

data have;
   length Group $ 1 Name $ 100;
   input Group Name &;

   datalines;
A  Cox & Wilson, PC, CPA
A  Cox & Wilson, PC, CPAs
A  Cox & Wilson, PC, CPAs
A  Jonathan D. Liner
A  Jonathan Liner
B  Memphis Light, Gas & Water
B  Memphis Light, Gas & Water
B  Memphis, Light, Gas and Water
C  Homer Electric Assn
C  Homer Electric Association
C  Homer Electric Association
;

data cleaned;
   set have;
   
   length NameCompare $ 100;
   
   NameCompare = lowcase(compbl(compress(Name,, 'p')));
run;


data want;
   set cleaned;
   by Group;
   
   length 
      NameGroup 8
      LastName $ 100
      Distance 8
   ;
   
   retain NameGroup 0;
      
   LastName = lag(NameCompare);
   
   if first.Group then do;
      NameGroup = NameGroup + 1;
      LastName = ' ';      
   end;
   else do;
      Distance = complev(NameCompare, LastName);
      
      /* finding the right threshold value will be interesting */
      if Distance > 10 then do;
         NameGroup = NameGroup + 1;
      end;
   end;
run;

View solution in original post

andreas_lds · Posted 05-04-2021 02:49 AM

If you do have a lot of obs anything will be time-consuming.

The idea of comparing pairs is interesting, but i would start by cleaning the strings and use complev() on the name variable.

Cleaning = use compress() + lowcase() + compbl()

Here is an idea:

data have;
   length Group $ 1 Name $ 100;
   input Group Name &;

   datalines;
A  Cox & Wilson, PC, CPA
A  Cox & Wilson, PC, CPAs
A  Cox & Wilson, PC, CPAs
A  Jonathan D. Liner
A  Jonathan Liner
B  Memphis Light, Gas & Water
B  Memphis Light, Gas & Water
B  Memphis, Light, Gas and Water
C  Homer Electric Assn
C  Homer Electric Association
C  Homer Electric Association
;

data cleaned;
   set have;
   
   length NameCompare $ 100;
   
   NameCompare = lowcase(compbl(compress(Name,, 'p')));
run;


data want;
   set cleaned;
   by Group;
   
   length 
      NameGroup 8
      LastName $ 100
      Distance 8
   ;
   
   retain NameGroup 0;
      
   LastName = lag(NameCompare);
   
   if first.Group then do;
      NameGroup = NameGroup + 1;
      LastName = ' ';      
   end;
   else do;
      Distance = complev(NameCompare, LastName);
      
      /* finding the right threshold value will be interesting */
      if Distance > 10 then do;
         NameGroup = NameGroup + 1;
      end;
   end;
run;