BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
daradanye
Obsidian | Level 7

Hi,

 

I am working on something like the following.

 

For a dataset, the original dataset looks like this:

Group Name
A Cox & Wilson, PC, CPA
A Cox & Wilson, PC, CPAs
A Cox & Wilson, PC, CPAs
A Jonathan D. Liner
A Jonathan Liner
B Memphis Light, Gas & Water
B Memphis Light, Gas & Water
B Memphis, Light, Gas and Water
C Homer Electric Assn
C Homer Electric Association
C Homer Electric Association

 

What I want to do is to categorize similar names within each group.  The dataset I want looks like this:

Group Name NameGroup
A Cox & Wilson, PC, CPA 1
A Cox & Wilson, PC, CPAs 1
A Cox & Wilson, PC, CPAs 1
A Jonathan D. Liner 2
A Jonathan Liner 2
B Memphis Light, Gas & Water 3
B Memphis Light, Gas & Water 3
B Memphis, Light, Gas and Water 3
C Homer Electric Assn 4
C Homer Electric Association 4
C Homer Electric Association 4

 

 

My initial thought is to compute string distance (Levenshtein) between each pair of two words and then use some cluster methods.  Then I realized there might be some difficulties:

1. First, I have a lot of observations.  Computing each pair even within the group can be time-consuming.

2. The cluster method I am familiar with is K-means.  However, K-means requires prespecifying number of groups.  I am wondering if there are any cluster methods that work better for this problem.

 

I am wondering if there are any more convenient ways to do that.

That would be great if someone can help out here.

1 ACCEPTED SOLUTION

Accepted Solutions
andreas_lds
Jade | Level 19

If you do have a lot of obs anything will be time-consuming.

The idea of comparing pairs is interesting, but i would start by cleaning the strings and use complev() on the name variable.

Cleaning = use compress() + lowcase() + compbl()

 

Here is an idea:

data have;
   length Group $ 1 Name $ 100;
   input Group Name &;

   datalines;
A  Cox & Wilson, PC, CPA
A  Cox & Wilson, PC, CPAs
A  Cox & Wilson, PC, CPAs
A  Jonathan D. Liner
A  Jonathan Liner
B  Memphis Light, Gas & Water
B  Memphis Light, Gas & Water
B  Memphis, Light, Gas and Water
C  Homer Electric Assn
C  Homer Electric Association
C  Homer Electric Association
;

data cleaned;
   set have;
   
   length NameCompare $ 100;
   
   NameCompare = lowcase(compbl(compress(Name,, 'p')));
run;


data want;
   set cleaned;
   by Group;
   
   length 
      NameGroup 8
      LastName $ 100
      Distance 8
   ;
   
   retain NameGroup 0;
      
   LastName = lag(NameCompare);
   
   if first.Group then do;
      NameGroup = NameGroup + 1;
      LastName = ' ';      
   end;
   else do;
      Distance = complev(NameCompare, LastName);
      
      /* finding the right threshold value will be interesting */
      if Distance > 10 then do;
         NameGroup = NameGroup + 1;
      end;
   end;
run;

View solution in original post

2 REPLIES 2
andreas_lds
Jade | Level 19

If you do have a lot of obs anything will be time-consuming.

The idea of comparing pairs is interesting, but i would start by cleaning the strings and use complev() on the name variable.

Cleaning = use compress() + lowcase() + compbl()

 

Here is an idea:

data have;
   length Group $ 1 Name $ 100;
   input Group Name &;

   datalines;
A  Cox & Wilson, PC, CPA
A  Cox & Wilson, PC, CPAs
A  Cox & Wilson, PC, CPAs
A  Jonathan D. Liner
A  Jonathan Liner
B  Memphis Light, Gas & Water
B  Memphis Light, Gas & Water
B  Memphis, Light, Gas and Water
C  Homer Electric Assn
C  Homer Electric Association
C  Homer Electric Association
;

data cleaned;
   set have;
   
   length NameCompare $ 100;
   
   NameCompare = lowcase(compbl(compress(Name,, 'p')));
run;


data want;
   set cleaned;
   by Group;
   
   length 
      NameGroup 8
      LastName $ 100
      Distance 8
   ;
   
   retain NameGroup 0;
      
   LastName = lag(NameCompare);
   
   if first.Group then do;
      NameGroup = NameGroup + 1;
      LastName = ' ';      
   end;
   else do;
      Distance = complev(NameCompare, LastName);
      
      /* finding the right threshold value will be interesting */
      if Distance > 10 then do;
         NameGroup = NameGroup + 1;
      end;
   end;
run;

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 2 replies
  • 1072 views
  • 2 likes
  • 3 in conversation