Hi,
I am working on something like the following.
For a dataset, the original dataset looks like this:
Group | Name |
A | Cox & Wilson, PC, CPA |
A | Cox & Wilson, PC, CPAs |
A | Cox & Wilson, PC, CPAs |
A | Jonathan D. Liner |
A | Jonathan Liner |
B | Memphis Light, Gas & Water |
B | Memphis Light, Gas & Water |
B | Memphis, Light, Gas and Water |
C | Homer Electric Assn |
C | Homer Electric Association |
C | Homer Electric Association |
What I want to do is to categorize similar names within each group. The dataset I want looks like this:
Group | Name | NameGroup |
A | Cox & Wilson, PC, CPA | 1 |
A | Cox & Wilson, PC, CPAs | 1 |
A | Cox & Wilson, PC, CPAs | 1 |
A | Jonathan D. Liner | 2 |
A | Jonathan Liner | 2 |
B | Memphis Light, Gas & Water | 3 |
B | Memphis Light, Gas & Water | 3 |
B | Memphis, Light, Gas and Water | 3 |
C | Homer Electric Assn | 4 |
C | Homer Electric Association | 4 |
C | Homer Electric Association | 4 |
My initial thought is to compute string distance (Levenshtein) between each pair of two words and then use some cluster methods. Then I realized there might be some difficulties:
1. First, I have a lot of observations. Computing each pair even within the group can be time-consuming.
2. The cluster method I am familiar with is K-means. However, K-means requires prespecifying number of groups. I am wondering if there are any cluster methods that work better for this problem.
I am wondering if there are any more convenient ways to do that.
That would be great if someone can help out here.
If you do have a lot of obs anything will be time-consuming.
The idea of comparing pairs is interesting, but i would start by cleaning the strings and use complev() on the name variable.
Cleaning = use compress() + lowcase() + compbl()
Here is an idea:
data have;
length Group $ 1 Name $ 100;
input Group Name &;
datalines;
A Cox & Wilson, PC, CPA
A Cox & Wilson, PC, CPAs
A Cox & Wilson, PC, CPAs
A Jonathan D. Liner
A Jonathan Liner
B Memphis Light, Gas & Water
B Memphis Light, Gas & Water
B Memphis, Light, Gas and Water
C Homer Electric Assn
C Homer Electric Association
C Homer Electric Association
;
data cleaned;
set have;
length NameCompare $ 100;
NameCompare = lowcase(compbl(compress(Name,, 'p')));
run;
data want;
set cleaned;
by Group;
length
NameGroup 8
LastName $ 100
Distance 8
;
retain NameGroup 0;
LastName = lag(NameCompare);
if first.Group then do;
NameGroup = NameGroup + 1;
LastName = ' ';
end;
else do;
Distance = complev(NameCompare, LastName);
/* finding the right threshold value will be interesting */
if Distance > 10 then do;
NameGroup = NameGroup + 1;
end;
end;
run;
If you do have a lot of obs anything will be time-consuming.
The idea of comparing pairs is interesting, but i would start by cleaning the strings and use complev() on the name variable.
Cleaning = use compress() + lowcase() + compbl()
Here is an idea:
data have;
length Group $ 1 Name $ 100;
input Group Name &;
datalines;
A Cox & Wilson, PC, CPA
A Cox & Wilson, PC, CPAs
A Cox & Wilson, PC, CPAs
A Jonathan D. Liner
A Jonathan Liner
B Memphis Light, Gas & Water
B Memphis Light, Gas & Water
B Memphis, Light, Gas and Water
C Homer Electric Assn
C Homer Electric Association
C Homer Electric Association
;
data cleaned;
set have;
length NameCompare $ 100;
NameCompare = lowcase(compbl(compress(Name,, 'p')));
run;
data want;
set cleaned;
by Group;
length
NameGroup 8
LastName $ 100
Distance 8
;
retain NameGroup 0;
LastName = lag(NameCompare);
if first.Group then do;
NameGroup = NameGroup + 1;
LastName = ' ';
end;
else do;
Distance = complev(NameCompare, LastName);
/* finding the right threshold value will be interesting */
if Distance > 10 then do;
NameGroup = NameGroup + 1;
end;
end;
run;
Catch the best of SAS Innovate 2025 — anytime, anywhere. Stream powerful keynotes, real-world demos, and game-changing insights from the world’s leading data and AI minds.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.