BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Rohit_1990
Calcite | Level 5

Hi Experts,

 

I have a table name one with column name person_name having almost 1 million records.

sample records are as follows:

 

Person_name

Michael

Michel

kurt

kirt

Michaell

Benjamin

Mich

 

I want to check each row and group them together such that edit distance among group is not more than 2.

 

so My final dataset would be like

 

Person_name    Group_name

Michael                       A

Michel                         A  

kurt                             B

kirt                              B

Michaell                      A

Benjamin                    C    

Mich                           D   

gr                               E

 

 

So  basically I am trying to group similar records in a column 

 

Thanks! Smiley Happy

 

 

 

 

 

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
PeterClemmensen
Tourmaline | Level 20

In line 28 in your log, you run the loop if n=1 then do;. In my code it is if _N_=1 then do;. There may be other errors as well, but it is bound to be that way if you change the code like this.

 

Please run my code exactly as it is written with the sample data 🙂

View solution in original post

20 REPLIES 20
ballardw
Super User

I think you need to explicitly define what you mean by "edit distance among group is not more than 2".

There are several SAS functions that do spelling distance, Complev, Compged and Spedis for example. But I  am not sure that the COMPLEV, which may come closest, means the same for 2 that you intend. Compged and Spedis would be much larger numeric values for the distance.

 

Plus your grouping may not be consistent with the results you say you want as with complev the name Mich is 2 from Michel which is 1 from Michael. So shouldn't Mich be in the same group as Michael??

 

You might want to see about reducing the names to distinct names to remove duplicate comparisons.

data have;
input person_name $;
datalines;
Michael
Michel
kurt
kirt
Michaell
Benjamin
Mich
;

proc sql;
   create table want as
   select a.person_name as namea, b.person_name as nameb
         , complev(a.person_name,b.person_name) as spelldist
   from have as a, have as b
   where  a.person_name < b.person_name
   order by spelldist, a.person_name
   ;
quit;

This is mostly to demonstrate the complev function. You could add "and spelldist le 2" to see just the close matches but with your full data would do 1million * 1million comparisons and will take a little while.

Rohit_1990
Calcite | Level 5
Hi ,

Thanks for your response,

By distance 2 I mean only two characters substitution/insertion/deletion is
allowed.

As it works with edit distance function in database.

your solution seems ok just that it is consuming bit of time given the
million records.

Thanks a lot again!!!!!!
ChrisNZ
Tourmaline | Level 20

> it is consuming bit of time given the million records.

 

Maybe can you could reduce the size of the Cartesian product by checking the similarity of the lengths or of the first letters.

mkeintz
PROC Star

You want groups such that the distance between any pair of words in the group is no more than 2.   Putting aside the computational burden, this criterion will not define a single collection of groups.

 

The "names"

AB

ABC

ABCD

ABCDE

 

will produce 2 groups using the 2 insertion/deletion rule, but those groups could be

  1. {AB,ABC,ABCD}   and {ABCDE}   or
  2. {AB,ABC}   and {ABCD,ABCDE}  or
  3. {AB}   and {ABC,ABCD,ABCDE}  or

Do you intend to define mutually exclusive groups?   Or all possible groupings that satisfy your criterion?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Rohit_1990
Calcite | Level 5
Hi,

I want mutually exclusive group.

Thanks!!!!!
ChrisNZ
Tourmaline | Level 20

In that case you'll need to group the groups.
Search for @PGStats 's article to find all paths in a directed graph network.

mkeintz
PROC Star

I just provided three examples of mutually exclusive groupings.  Which example do you want?  What is the criterion?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Rohit_1990
Calcite | Level 5
Hi Mark,

It would be graet if you could provide solution for criterion 2 and 3.


Regards,
Rohit
PeterClemmensen
Tourmaline | Level 20

@Rohit_1990 , how about this?

 

I changed your groups to be represented by numbers, since it is easier to iterate that way.

 

data have;
input Person_name:$20.;
datalines;
Michael
Michel
kurt
kirt
Michaell
Benjamin
Mich
;

data want(keep=Person_name Group_name);
   length Person_name $20 Comp_Name $20 Group_name 8;

   if _N_ = 1 then do;
      declare hash h();
      h.defineKey('Comp_Name');
      h.defineData('Comp_Name', 'Group_name');
      h.defineDone();
      declare hiter hi('h');

      declare hash hh(multidata:'Y');
      hh.defineKey('Group_name');
      hh.defineData('Group_name', 'Comp_Name');
      hh.defineDone(); 

      _Group_Name=0;
   end;

   set have;

   rc=h.find(key:Person_name);

   if rc ne 0 then do;
      rc=hi.first();
      do while (rc=0);

         if complev(Person_name, Comp_Name) le 2 then do;
            rc=hh.find();

            do while (r ne 0);
               dist=complev(Person_name, Comp_Name);
               hh.has_next(result:r);

               if r=0 & dist le 2 then do;
                  h.add(key:Person_name, data:Person_name, data:Group_Name);
                  hh.add();
                  output;return;
               end;

               else if r ne 0 & dist le 2 then rc=hh.find_next();

               else if dist > 2 then leave;
            
            end;

         end;
         
         rc=hi.next();

      end;

      _Group_Name+1;
      Group_Name=_Group_Name;
      h.add(key:Person_name, data:Person_name, data:Group_Name);
      hh.add(key:Group_Name, data:Group_Name, data:Person_name);
   end;

   output;
run;

 

This gives you

 

Capture.PNG 

Rohit_1990
Calcite | Level 5
Hi,

Thanks for your solution but somehow code is not working.
rc=h.find(key:Person_name);
It is getting aborted are above line.

Can you please check.

Regards,



PeterClemmensen
Tourmaline | Level 20

Does this error occur when you run the code on the sample data (at the top of my code) or your own data?

 

If it occurs with your own data, then make sure that your variable is named Person_Name

Rohit_1990
Calcite | Level 5
It occurs when I run code on sample data provided in example.
PeterClemmensen
Tourmaline | Level 20
Can you post the full log please?
Rohit_1990
Calcite | Level 5
Hi,
Please find the log and code below






1 The SAS System
17:11 Friday, March 29, 2019

1 ;*';*";*/;quit;run;
2 OPTIONS PAGENO=MIN;
3 %LET _CLIENTTASKLABEL='Program (2)';
4 %LET _CLIENTPROCESSFLOWNAME='Process Flow';
5 %LET _CLIENTPROJECTPATH='';
6 %LET _CLIENTPROJECTNAME='';
7 %LET _SASPROGRAMFILE=;
8
9 ODS _ALL_ CLOSE;
10 OPTIONS DEV=ACTIVEX;
11 GOPTIONS XPIXELS=0 YPIXELS=0;
12 FILENAME EGSR TEMP;
13 ODS tagsets.sasreport13(ID=EGSR) FILE=EGSR
14 STYLE=HtmlBlue
15
STYLESHEET=(URL="file:///D:/SAS/sashome/SASEnterpriseGuide/7.1/Styles/HtmlBlue.css")
16 NOGTITLE
17 NOGFOOTNOTE
18 GPATH=&sasworklocation
19 ENCODING=UTF8
20 options(rolap="on")
21 ;
NOTE: Writing TAGSETS.SASREPORT13(EGSR) Body file: EGSR
22
23 GOPTIONS ACCESSIBLE;
24 data want(keep=person_name group_name);
25 length person_name $20
26 comp_name $20 group_name 8;
27
28 if n=1 then do;
29 declare hash h();
30 h.definekey('comp_name');
31 h.definedata('comp_name','group_name');
32 h.defineDone();
33 declare hiter hi('h');
34
35 declare hash hh(multidata:'y');
36 hh.definekey('group_name');
37 h.definedata('group_name','comp_name');
38 hh.defineDone();
39
40 _group_name=0;
41 end;
42
43 set have;
44
45 rc=h.find(key:person_name);
46
47 if rc ne 0 then do;
48 rc=hi.first();
49 do while(rc=0);
50
51 if complev(person_name,comp_name)le 2 then do;
52 rc=hh.find();
53
54 do while(r ne 0);
55 dist=complev(person_name,comp_name);
56 hh.has_next(result:r);
57
2 The SAS System
17:11 Friday, March 29, 2019

58 if r=0 & dist le 2 then do;
59 h.add(key:person_name, data:person_name, data:group_name);
60 hh.add();
61 output;
62 return;
63 end;
64
65 else if r ne 0 & dist le 2 then
66 rc=hh.find_next();
67
68 else if dist>2 then leave;
69
70 end;
71
72 end;
73
74 rc=hi.next();
75
76 end;
77
78 _group_name+1;
79 group_name=_group_name;
80 h.add(key:person_name, data:person_name, data:group_name);
81 hh.add(key:group_name, data:group_name, data:person_name);
82 end;
83
84 output;
85 run;

NOTE: Variable comp_name is uninitialized.
NOTE: Variable n is uninitialized.
ERROR: Uninitialized object at line 45 column 4.
ERROR: DATA STEP Component Object failure. Aborted during the EXECUTION
phase.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 1 observations read from the data set WORK.HAVE.
WARNING: The data set WORK.WANT may be incomplete. When this step was
stopped there were 0 observations and 2 variables.
WARNING: Data set WORK.WANT was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.02 seconds


86
87 GOPTIONS NOACCESSIBLE;
88 %LET _CLIENTTASKLABEL=;
89 %LET _CLIENTPROCESSFLOWNAME=;
90 %LET _CLIENTPROJECTPATH=;
91 %LET _CLIENTPROJECTNAME=;
92 %LET _SASPROGRAMFILE=;
93
94 ;*';*";*/;quit;run;
95 ODS _ALL_ CLOSE;
96
97
98 QUIT; RUN;
99


------------------------------------------------------
------------------------------------------------------

data have;
input person_name:$20.;
datalines;
michael
michel
kurt
kirt
michaell
benjamin
mich
;
run;

data want(keep=person_name group_name);
length person_name $20
comp_name $20 group_name 8;

if n=1 then do;
declare hash h();
h.definekey('comp_name');
h.definedata('comp_name','group_name');
h.defineDone();
declare hiter hi('h');

declare hash hh(multidata:'y');
hh.definekey('group_name');
h.definedata('group_name','comp_name');
hh.defineDone();

_group_name=0;
end;

set have;

rc=h.find(key:person_name);

if rc ne 0 then do;
rc=hi.first();
do while(rc=0);

if complev(person_name,comp_name)le 2 then do;
rc=hh.find();

do while(r ne 0);
dist=complev(person_name,comp_name);
hh.has_next(result:r);

if r=0 & dist le 2 then do;
h.add(key:person_name, data:person_name, data:group_name);
hh.add();
output;
return;
end;

else if r ne 0 & dist le 2 then
rc=hh.find_next();

else if dist>2 then leave;

end;

end;

rc=hi.next();

end;

_group_name+1;
group_name=_group_name;
h.add(key:person_name, data:person_name, data:group_name);
hh.add(key:group_name, data:group_name, data:person_name);
end;

output;
run;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 20 replies
  • 4484 views
  • 1 like
  • 5 in conversation