BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
daradanye
Obsidian | Level 7

Hi,

 

I am working on something like the following.

 

For a dataset, the original dataset looks like this:

Group Name
A Cox & Wilson, PC, CPA
A Cox & Wilson, PC, CPAs
A Cox & Wilson, PC, CPAs
A Jonathan D. Liner
A Jonathan Liner
B Memphis Light, Gas & Water
B Memphis Light, Gas & Water
B Memphis, Light, Gas and Water
C Homer Electric Assn
C Homer Electric Association
C Homer Electric Association

 

What I want to do is to categorize similar names within each group.  The dataset I want looks like this:

Group Name NameGroup
A Cox & Wilson, PC, CPA 1
A Cox & Wilson, PC, CPAs 1
A Cox & Wilson, PC, CPAs 1
A Jonathan D. Liner 2
A Jonathan Liner 2
B Memphis Light, Gas & Water 3
B Memphis Light, Gas & Water 3
B Memphis, Light, Gas and Water 3
C Homer Electric Assn 4
C Homer Electric Association 4
C Homer Electric Association 4

 

 

My initial thought is to compute string distance (Levenshtein) between each pair of two words and then use some cluster methods.  Then I realized there might be some difficulties:

1. First, I have a lot of observations.  Computing each pair even within the group can be time-consuming.

2. The cluster method I am familiar with is K-means.  However, K-means requires prespecifying number of groups.  I am wondering if there are any cluster methods that work better for this problem.

 

I am wondering if there are any more convenient ways to do that.

That would be great if someone can help out here.

1 ACCEPTED SOLUTION

Accepted Solutions
andreas_lds
Jade | Level 19

If you do have a lot of obs anything will be time-consuming.

The idea of comparing pairs is interesting, but i would start by cleaning the strings and use complev() on the name variable.

Cleaning = use compress() + lowcase() + compbl()

 

Here is an idea:

data have;
   length Group $ 1 Name $ 100;
   input Group Name &;

   datalines;
A  Cox & Wilson, PC, CPA
A  Cox & Wilson, PC, CPAs
A  Cox & Wilson, PC, CPAs
A  Jonathan D. Liner
A  Jonathan Liner
B  Memphis Light, Gas & Water
B  Memphis Light, Gas & Water
B  Memphis, Light, Gas and Water
C  Homer Electric Assn
C  Homer Electric Association
C  Homer Electric Association
;

data cleaned;
   set have;
   
   length NameCompare $ 100;
   
   NameCompare = lowcase(compbl(compress(Name,, 'p')));
run;


data want;
   set cleaned;
   by Group;
   
   length 
      NameGroup 8
      LastName $ 100
      Distance 8
   ;
   
   retain NameGroup 0;
      
   LastName = lag(NameCompare);
   
   if first.Group then do;
      NameGroup = NameGroup + 1;
      LastName = ' ';      
   end;
   else do;
      Distance = complev(NameCompare, LastName);
      
      /* finding the right threshold value will be interesting */
      if Distance > 10 then do;
         NameGroup = NameGroup + 1;
      end;
   end;
run;

View solution in original post

2 REPLIES 2
andreas_lds
Jade | Level 19

If you do have a lot of obs anything will be time-consuming.

The idea of comparing pairs is interesting, but i would start by cleaning the strings and use complev() on the name variable.

Cleaning = use compress() + lowcase() + compbl()

 

Here is an idea:

data have;
   length Group $ 1 Name $ 100;
   input Group Name &;

   datalines;
A  Cox & Wilson, PC, CPA
A  Cox & Wilson, PC, CPAs
A  Cox & Wilson, PC, CPAs
A  Jonathan D. Liner
A  Jonathan Liner
B  Memphis Light, Gas & Water
B  Memphis Light, Gas & Water
B  Memphis, Light, Gas and Water
C  Homer Electric Assn
C  Homer Electric Association
C  Homer Electric Association
;

data cleaned;
   set have;
   
   length NameCompare $ 100;
   
   NameCompare = lowcase(compbl(compress(Name,, 'p')));
run;


data want;
   set cleaned;
   by Group;
   
   length 
      NameGroup 8
      LastName $ 100
      Distance 8
   ;
   
   retain NameGroup 0;
      
   LastName = lag(NameCompare);
   
   if first.Group then do;
      NameGroup = NameGroup + 1;
      LastName = ' ';      
   end;
   else do;
      Distance = complev(NameCompare, LastName);
      
      /* finding the right threshold value will be interesting */
      if Distance > 10 then do;
         NameGroup = NameGroup + 1;
      end;
   end;
run;
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 2 replies
  • 2092 views
  • 2 likes
  • 3 in conversation