Hi,
I am working on something like the following.
For a dataset, the original dataset looks like this:
Group | Name |
A | Cox & Wilson, PC, CPA |
A | Cox & Wilson, PC, CPAs |
A | Cox & Wilson, PC, CPAs |
A | Jonathan D. Liner |
A | Jonathan Liner |
B | Memphis Light, Gas & Water |
B | Memphis Light, Gas & Water |
B | Memphis, Light, Gas and Water |
C | Homer Electric Assn |
C | Homer Electric Association |
C | Homer Electric Association |
What I want to do is to categorize similar names within each group. The dataset I want looks like this:
Group | Name | NameGroup |
A | Cox & Wilson, PC, CPA | 1 |
A | Cox & Wilson, PC, CPAs | 1 |
A | Cox & Wilson, PC, CPAs | 1 |
A | Jonathan D. Liner | 2 |
A | Jonathan Liner | 2 |
B | Memphis Light, Gas & Water | 3 |
B | Memphis Light, Gas & Water | 3 |
B | Memphis, Light, Gas and Water | 3 |
C | Homer Electric Assn | 4 |
C | Homer Electric Association | 4 |
C | Homer Electric Association | 4 |
My initial thought is to compute string distance (Levenshtein) between each pair of two words and then use some cluster methods. Then I realized there might be some difficulties:
1. First, I have a lot of observations. Computing each pair even within the group can be time-consuming.
2. The cluster method I am familiar with is K-means. However, K-means requires prespecifying number of groups. I am wondering if there are any cluster methods that work better for this problem.
I am wondering if there are any more convenient ways to do that.
That would be great if someone can help out here.
If you do have a lot of obs anything will be time-consuming.
The idea of comparing pairs is interesting, but i would start by cleaning the strings and use complev() on the name variable.
Cleaning = use compress() + lowcase() + compbl()
Here is an idea:
data have;
length Group $ 1 Name $ 100;
input Group Name &;
datalines;
A Cox & Wilson, PC, CPA
A Cox & Wilson, PC, CPAs
A Cox & Wilson, PC, CPAs
A Jonathan D. Liner
A Jonathan Liner
B Memphis Light, Gas & Water
B Memphis Light, Gas & Water
B Memphis, Light, Gas and Water
C Homer Electric Assn
C Homer Electric Association
C Homer Electric Association
;
data cleaned;
set have;
length NameCompare $ 100;
NameCompare = lowcase(compbl(compress(Name,, 'p')));
run;
data want;
set cleaned;
by Group;
length
NameGroup 8
LastName $ 100
Distance 8
;
retain NameGroup 0;
LastName = lag(NameCompare);
if first.Group then do;
NameGroup = NameGroup + 1;
LastName = ' ';
end;
else do;
Distance = complev(NameCompare, LastName);
/* finding the right threshold value will be interesting */
if Distance > 10 then do;
NameGroup = NameGroup + 1;
end;
end;
run;
If you do have a lot of obs anything will be time-consuming.
The idea of comparing pairs is interesting, but i would start by cleaning the strings and use complev() on the name variable.
Cleaning = use compress() + lowcase() + compbl()
Here is an idea:
data have;
length Group $ 1 Name $ 100;
input Group Name &;
datalines;
A Cox & Wilson, PC, CPA
A Cox & Wilson, PC, CPAs
A Cox & Wilson, PC, CPAs
A Jonathan D. Liner
A Jonathan Liner
B Memphis Light, Gas & Water
B Memphis Light, Gas & Water
B Memphis, Light, Gas and Water
C Homer Electric Assn
C Homer Electric Association
C Homer Electric Association
;
data cleaned;
set have;
length NameCompare $ 100;
NameCompare = lowcase(compbl(compress(Name,, 'p')));
run;
data want;
set cleaned;
by Group;
length
NameGroup 8
LastName $ 100
Distance 8
;
retain NameGroup 0;
LastName = lag(NameCompare);
if first.Group then do;
NameGroup = NameGroup + 1;
LastName = ' ';
end;
else do;
Distance = complev(NameCompare, LastName);
/* finding the right threshold value will be interesting */
if Distance > 10 then do;
NameGroup = NameGroup + 1;
end;
end;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.