Solved: Re: Fuzzy logic to locate duplicates in a table

Rohit_1990 · Posted 03-25-2019 01:20 PM

Hi Experts,

I have a table name one with column name person_name having almost 1 million records.

sample records are as follows:

Person_name

Michael

Michel

kurt

kirt

Michaell

Benjamin

Mich

I want to check each row and group them together such that edit distance among group is not more than 2.

so My final dataset would be like

Person_name Group_name

Michael A

Michel A

kurt B

kirt B

Michaell A

Benjamin C

Mich D

gr E

So basically I am trying to group similar records in a column

Thanks!

PeterClemmensen · Posted 03-29-2019 12:52 PM

In line 28 in your log, you run the loop if n=1 then do;. In my code it is if _N_=1 then do;. There may be other errors as well, but it is bound to be that way if you change the code like this.

Please run my code exactly as it is written with the sample data 🙂

The DATA to DATA Step Macro
Blog: SASnrd

View solution in original post

ballardw · Posted 03-25-2019 02:09 PM

I think you need to explicitly define what you mean by "edit distance among group is not more than 2".

There are several SAS functions that do spelling distance, Complev, Compged and Spedis for example. But I am not sure that the COMPLEV, which may come closest, means the same for 2 that you intend. Compged and Spedis would be much larger numeric values for the distance.

Plus your grouping may not be consistent with the results you say you want as with complev the name Mich is 2 from Michel which is 1 from Michael. So shouldn't Mich be in the same group as Michael??

You might want to see about reducing the names to distinct names to remove duplicate comparisons.

data have;
input person_name $;
datalines;
Michael
Michel
kurt
kirt
Michaell
Benjamin
Mich
;

proc sql;
   create table want as
   select a.person_name as namea, b.person_name as nameb
         , complev(a.person_name,b.person_name) as spelldist
   from have as a, have as b
   where  a.person_name < b.person_name
   order by spelldist, a.person_name
   ;
quit;

This is mostly to demonstrate the complev function. You could add "and spelldist le 2" to see just the close matches but with your full data would do 1million * 1million comparisons and will take a little while.

Rohit_1990 · Posted 03-26-2019 11:26 AM

Hi ,

Thanks for your response,

By distance 2 I mean only two characters substitution/insertion/deletion is
allowed.

As it works with edit distance function in database.

your solution seems ok just that it is consuming bit of time given the
million records.

Thanks a lot again!!!!!!

ChrisNZ · Posted 03-26-2019 11:04 PM

> it is consuming bit of time given the million records.

Maybe can you could reduce the size of the Cartesian product by checking the similarity of the lengths or of the first letters.

High-Performance SAS Coding - Third Edition

mkeintz · Posted 03-27-2019 02:37 AM

You want groups such that the distance between any pair of words in the group is no more than 2. Putting aside the computational burden, this criterion will not define a single collection of groups.

The "names"

AB

ABC

ABCD

ABCDE

will produce 2 groups using the 2 insertion/deletion rule, but those groups could be

{AB,ABC,ABCD} and {ABCDE} or
{AB,ABC} and {ABCD,ABCDE} or
{AB} and {ABC,ABCD,ABCDE} or

Do you intend to define mutually exclusive groups? Or all possible groupings that satisfy your criterion?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Rohit_1990 · Posted 03-27-2019 03:15 AM

Hi,

I want mutually exclusive group.

Thanks!!!!!

ChrisNZ · Posted 03-27-2019 04:18 AM

In that case you'll need to group the groups.
Search for @PGStats 's article to find all paths in a directed graph network.

High-Performance SAS Coding - Third Edition

mkeintz · Posted 03-27-2019 03:00 PM

I just provided three examples of mutually exclusive groupings. Which example do you want? What is the criterion?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Rohit_1990 · Posted 03-27-2019 05:14 PM

Hi Mark,

It would be graet if you could provide solution for criterion 2 and 3.

Regards,
Rohit

PeterClemmensen · Posted 03-28-2019 01:42 PM

@Rohit_1990 , how about this?

I changed your groups to be represented by numbers, since it is easier to iterate that way.

data have;
input Person_name:$20.;
datalines;
Michael
Michel
kurt
kirt
Michaell
Benjamin
Mich
;

data want(keep=Person_name Group_name);
   length Person_name $20 Comp_Name $20 Group_name 8;

   if _N_ = 1 then do;
      declare hash h();
      h.defineKey('Comp_Name');
      h.defineData('Comp_Name', 'Group_name');
      h.defineDone();
      declare hiter hi('h');

      declare hash hh(multidata:'Y');
      hh.defineKey('Group_name');
      hh.defineData('Group_name', 'Comp_Name');
      hh.defineDone(); 

      _Group_Name=0;
   end;

   set have;

   rc=h.find(key:Person_name);

   if rc ne 0 then do;
      rc=hi.first();
      do while (rc=0);

         if complev(Person_name, Comp_Name) le 2 then do;
            rc=hh.find();

            do while (r ne 0);
               dist=complev(Person_name, Comp_Name);
               hh.has_next(result:r);

               if r=0 & dist le 2 then do;
                  h.add(key:Person_name, data:Person_name, data:Group_Name);
                  hh.add();
                  output;return;
               end;

               else if r ne 0 & dist le 2 then rc=hh.find_next();

               else if dist > 2 then leave;
            
            end;

         end;
         
         rc=hi.next();

      end;

      _Group_Name+1;
      Group_Name=_Group_Name;
      h.add(key:Person_name, data:Person_name, data:Group_Name);
      hh.add(key:Group_Name, data:Group_Name, data:Person_name);
   end;

   output;
run;

This gives you

The DATA to DATA Step Macro
Blog: SASnrd

Rohit_1990 · Posted 03-29-2019 04:05 AM

Hi,

Thanks for your solution but somehow code is not working.
rc=h.find(key:Person_name);
It is getting aborted are above line.

Can you please check.

Regards,

PeterClemmensen · Posted 03-29-2019 05:26 AM

Does this error occur when you run the code on the sample data (at the top of my code) or your own data?

If it occurs with your own data, then make sure that your variable is named Person_Name

The DATA to DATA Step Macro
Blog: SASnrd

Rohit_1990 · Posted 03-29-2019 06:29 AM

It occurs when I run code on sample data provided in example.

PeterClemmensen · Posted 03-29-2019 06:37 AM

Can you post the full log please?

The DATA to DATA Step Macro
Blog: SASnrd

Rohit_1990 · Posted 03-29-2019 12:22 PM

Hi,
Please find the log and code below

1 The SAS System
17:11 Friday, March 29, 2019

1 ;*';*";*/;quit;run;
2 OPTIONS PAGENO=MIN;
3 %LET _CLIENTTASKLABEL='Program (2)';
4 %LET _CLIENTPROCESSFLOWNAME='Process Flow';
5 %LET _CLIENTPROJECTPATH='';
6 %LET _CLIENTPROJECTNAME='';
7 %LET _SASPROGRAMFILE=;
8
9 ODS _ALL_ CLOSE;
10 OPTIONS DEV=ACTIVEX;
11 GOPTIONS XPIXELS=0 YPIXELS=0;
12 FILENAME EGSR TEMP;
13 ODS tagsets.sasreport13(ID=EGSR) FILE=EGSR
14 STYLE=HtmlBlue
15
STYLESHEET=(URL="file:///D:/SAS/sashome/SASEnterpriseGuide/7.1/Styles/HtmlBlue.css")
16 NOGTITLE
17 NOGFOOTNOTE
18 GPATH=&sasworklocation
19 ENCODING=UTF8
20 options(rolap="on")
21 ;
NOTE: Writing TAGSETS.SASREPORT13(EGSR) Body file: EGSR
22
23 GOPTIONS ACCESSIBLE;
24 data want(keep=person_name group_name);
25 length person_name $20
26 comp_name $20 group_name 8;
27
28 if n=1 then do;
29 declare hash h();
30 h.definekey('comp_name');
31 h.definedata('comp_name','group_name');
32 h.defineDone();
33 declare hiter hi('h');
34
35 declare hash hh(multidata:'y');
36 hh.definekey('group_name');
37 h.definedata('group_name','comp_name');
38 hh.defineDone();
39
40 _group_name=0;
41 end;
42
43 set have;
44
45 rc=h.find(key:person_name);
46
47 if rc ne 0 then do;
48 rc=hi.first();
49 do while(rc=0);
50
51 if complev(person_name,comp_name)le 2 then do;
52 rc=hh.find();
53
54 do while(r ne 0);
55 dist=complev(person_name,comp_name);
56 hh.has_next(result:r);
57
2 The SAS System
17:11 Friday, March 29, 2019

58 if r=0 & dist le 2 then do;
59 h.add(key:person_name, data:person_name, data:group_name);
60 hh.add();
61 output;
62 return;
63 end;
64
65 else if r ne 0 & dist le 2 then
66 rc=hh.find_next();
67
68 else if dist>2 then leave;
69
70 end;
71
72 end;
73
74 rc=hi.next();
75
76 end;
77
78 _group_name+1;
79 group_name=_group_name;
80 h.add(key:person_name, data:person_name, data:group_name);
81 hh.add(key:group_name, data:group_name, data:person_name);
82 end;
83
84 output;
85 run;

NOTE: Variable comp_name is uninitialized.
NOTE: Variable n is uninitialized.
ERROR: Uninitialized object at line 45 column 4.
ERROR: DATA STEP Component Object failure. Aborted during the EXECUTION
phase.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 1 observations read from the data set WORK.HAVE.
WARNING: The data set WORK.WANT may be incomplete. When this step was
stopped there were 0 observations and 2 variables.
WARNING: Data set WORK.WANT was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.02 seconds

86
87 GOPTIONS NOACCESSIBLE;
88 %LET _CLIENTTASKLABEL=;
89 %LET _CLIENTPROCESSFLOWNAME=;
90 %LET _CLIENTPROJECTPATH=;
91 %LET _CLIENTPROJECTNAME=;
92 %LET _SASPROGRAMFILE=;
93
94 ;*';*";*/;quit;run;
95 ODS _ALL_ CLOSE;
96
97
98 QUIT; RUN;
99

------------------------------------------------------
------------------------------------------------------

data have;
input person_name:$20.;
datalines;
michael
michel
kurt
kirt
michaell
benjamin
mich
;
run;

data want(keep=person_name group_name);
length person_name $20
comp_name $20 group_name 8;

if n=1 then do;
declare hash h();
h.definekey('comp_name');
h.definedata('comp_name','group_name');
h.defineDone();
declare hiter hi('h');

declare hash hh(multidata:'y');
hh.definekey('group_name');
h.definedata('group_name','comp_name');
hh.defineDone();

_group_name=0;
end;

set have;

rc=h.find(key:person_name);

if rc ne 0 then do;
rc=hi.first();
do while(rc=0);

if complev(person_name,comp_name)le 2 then do;
rc=hh.find();

do while(r ne 0);
dist=complev(person_name,comp_name);
hh.has_next(result:r);

if r=0 & dist le 2 then do;
h.add(key:person_name, data:person_name, data:group_name);
hh.add();
output;
return;
end;

else if r ne 0 & dist le 2 then
rc=hh.find_next();

else if dist>2 then leave;

end;

end;

rc=hi.next();

end;

_group_name+1;
group_name=_group_name;
h.add(key:person_name, data:person_name, data:group_name);
hh.add(key:group_name, data:group_name, data:person_name);
end;

output;
run;

Register Today!

SAS Training: Just a Click Away