Hi Experts,
I have a table name one with column name person_name having almost 1 million records.
sample records are as follows:
Person_name
Michael
Michel
kurt
kirt
Michaell
Benjamin
Mich
I want to check each row and group them together such that edit distance among group is not more than 2.
so My final dataset would be like
Person_name Group_name
Michael A
Michel A
kurt B
kirt B
Michaell A
Benjamin C
Mich D
gr E
So basically I am trying to group similar records in a column
Thanks!
In line 28 in your log, you run the loop if n=1 then do;. In my code it is if _N_=1 then do;. There may be other errors as well, but it is bound to be that way if you change the code like this.
Please run my code exactly as it is written with the sample data 🙂
I think you need to explicitly define what you mean by "edit distance among group is not more than 2".
There are several SAS functions that do spelling distance, Complev, Compged and Spedis for example. But I am not sure that the COMPLEV, which may come closest, means the same for 2 that you intend. Compged and Spedis would be much larger numeric values for the distance.
Plus your grouping may not be consistent with the results you say you want as with complev the name Mich is 2 from Michel which is 1 from Michael. So shouldn't Mich be in the same group as Michael??
You might want to see about reducing the names to distinct names to remove duplicate comparisons.
data have; input person_name $; datalines; Michael Michel kurt kirt Michaell Benjamin Mich ; proc sql; create table want as select a.person_name as namea, b.person_name as nameb , complev(a.person_name,b.person_name) as spelldist from have as a, have as b where a.person_name < b.person_name order by spelldist, a.person_name ; quit;
This is mostly to demonstrate the complev function. You could add "and spelldist le 2" to see just the close matches but with your full data would do 1million * 1million comparisons and will take a little while.
> it is consuming bit of time given the million records.
Maybe can you could reduce the size of the Cartesian product by checking the similarity of the lengths or of the first letters.
You want groups such that the distance between any pair of words in the group is no more than 2. Putting aside the computational burden, this criterion will not define a single collection of groups.
The "names"
AB
ABC
ABCD
ABCDE
will produce 2 groups using the 2 insertion/deletion rule, but those groups could be
Do you intend to define mutually exclusive groups? Or all possible groupings that satisfy your criterion?
In that case you'll need to group the groups.
Search for @PGStats 's article to find all paths in a directed graph network.
I just provided three examples of mutually exclusive groupings. Which example do you want? What is the criterion?
@Rohit_1990 , how about this?
I changed your groups to be represented by numbers, since it is easier to iterate that way.
data have;
input Person_name:$20.;
datalines;
Michael
Michel
kurt
kirt
Michaell
Benjamin
Mich
;
data want(keep=Person_name Group_name);
length Person_name $20 Comp_Name $20 Group_name 8;
if _N_ = 1 then do;
declare hash h();
h.defineKey('Comp_Name');
h.defineData('Comp_Name', 'Group_name');
h.defineDone();
declare hiter hi('h');
declare hash hh(multidata:'Y');
hh.defineKey('Group_name');
hh.defineData('Group_name', 'Comp_Name');
hh.defineDone();
_Group_Name=0;
end;
set have;
rc=h.find(key:Person_name);
if rc ne 0 then do;
rc=hi.first();
do while (rc=0);
if complev(Person_name, Comp_Name) le 2 then do;
rc=hh.find();
do while (r ne 0);
dist=complev(Person_name, Comp_Name);
hh.has_next(result:r);
if r=0 & dist le 2 then do;
h.add(key:Person_name, data:Person_name, data:Group_Name);
hh.add();
output;return;
end;
else if r ne 0 & dist le 2 then rc=hh.find_next();
else if dist > 2 then leave;
end;
end;
rc=hi.next();
end;
_Group_Name+1;
Group_Name=_Group_Name;
h.add(key:Person_name, data:Person_name, data:Group_Name);
hh.add(key:Group_Name, data:Group_Name, data:Person_name);
end;
output;
run;
This gives you
Does this error occur when you run the code on the sample data (at the top of my code) or your own data?
If it occurs with your own data, then make sure that your variable is named Person_Name
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.