BookmarkSubscribeRSS Feed
chemicalab
Fluorite | Level 6

Hi all,

Kinda new in the SAS world so i would gladly take your advice. My issue is as following: I dont possess text miner and i am trying to handle a case with text via SAS Base.

So what i want to figure out is to recognize if in a group of rows (5,7,3) the names match, meaning

if i have for example Alex Smith  in row 1 and Smith Alex in row 2 that the program will figure that it is the same name.

In addition there can be a rows where the name is   Smith Alexander  or Smith Alex which is the same name and i would like SAS to recognize that.

That means that if two rows have at least 2 words in common (the total for each row would be lets say 3 words) i would like to find a command so that SAS can consider them the same and therefore place them in the same group.

I hope it makes sense and hope in addition that any advice can be found here.

Thnx in advance

6 REPLIES 6
FriedEgg
SAS Employee

The following definitly doesn't work but it may give you or others some ideas.  I am not good with hash objects but I assume my example could be duplicated using one instead of my proc sql to explode the inputed records.  Name matching algorithms are extremly complex and this is just a very dirty, low-quality match assurance test...  In the output of the linkage set you can clearly see some matches are definitly not desireable, but I consider the results good for a few minutes work... Smiley Happy

data have;

input id (namep1-namep2) (:$20.);

retain yes 1;

cards;

1 Alex Smith

2 Alexander Smith

3 Smith Alex

4 Smith Alexander

5 Matt Roberts

6 Matthew Robertson

7 Robert Matthews

8 Robert Mattherson

9 Matt Thompson

10 Tom Matterhorn

11 Alexis Smith

12 John Smith

13 Smith Alexis

;

run;

proc sql;

create table want as

select a.id, compress(a.namep1,,'kw') as namep1, compress(a.namep2,,'kw') as namep2, compress(b.namep1,,'kw') as namep1x, compress(b.namep2,,'kw') as namep2x, b.id as idx

   from have a

   join have b on a.id lt b.id;

quit;

data linkage;

set want;

/*----------------------------------------------*

  |  i = ignore case                             |

  |  l = remove leading blanks                   |

  |  : = truncate to shortest length and compare |

  *----------------------------------------------*/

  /* get complev of a:a */

  complev1=complev(namep1,namep1x,'il');

  complev2=complev(namep2,namep2x,'il');

  complev1t=complev(namep1,namep1x,'il:');

  complev2t=complev(namep2,namep2x,'il:');

  avgcomplev_aa=mean(of complev1,complev2,complev1t,complev2t);

  if avgcomlev_aa<3 then

   do;

    complev=avgcomplev_aa;

          aa=1;

   end;

  /* get complev of a:b */

  complev1x=complev(namep1,namep2x,'il');

  complev2x=complev(namep2,namep1x,'il');

  _complev1xt=complev(namep1,namep2x,'il:');

  _complev2xt=complev(namep2,namep1x,'il:');

  avgcomplev_ab=mean(of complev1x,complev2x,complev1xt,complev2xt);

  if avgcomplev_ab<3 then

   do;

          complev=avgcomplev_ab;

          _namep2x=namep2x;

          namep2x=namep1x;

          namep1x=_namep2x;

          aa=0;

   end;

  if complev<3;

keep id idx namep1 namep2 namep1x namep2x complev aa;

run;

proc print data=linkage; run;

                       Obs    id    namep1       namep2       namep1x      namep2x       idx    complev    aa

                         1     1    Alex         Smith        Alexander    Smith           2      2.5       1

                         2     1    Alex         Smith        Alex         Smith           3      0.0       0

                         3     1    Alex         Smith        Alexander    Smith           4      2.5       0

                         4     1    Alex         Smith        Alexis       Smith          11      1.0       1

                         5     1    Alex         Smith        John         Smith          12      2.0       1

                         6     1    Alex         Smith        Alexis       Smith          13      1.0       0

                         7     2    Alexander    Smith        Alex         Smith           3      2.5       0

                         8     2    Alexander    Smith        Alexander    Smith           4      0.0       0

                         9     2    Alexander    Smith        Alexis       Smith          11      2.5       1

                        10     2    Alexander    Smith        Alexis       Smith          13      2.5       0

                        11     3    Smith        Alex         Smith        Alexander       4      2.5       1

                        12     3    Smith        Alex         Smith        Alexis         11      1.0       0

                        13     3    Smith        Alex         Smith        John           12      2.0       0

                        14     3    Smith        Alex         Smith        Alexis         13      1.0       1

                        15     4    Smith        Alexander    Smith        Alexis         11      2.5       0

                        16     4    Smith        Alexander    Smith        Alexis         13      2.5       1

                        17     5    Matt         Roberts      Matthew      Robertson       6      2.5       1

                        18     5    Matt         Roberts      Matthews     Robert          7      2.5       0

                        19     6    Matthew      Robertson    Matthews     Robert          7      2.0       0

                        20     7    Robert       Matthews     Robert       Mattherson      8      1.5       1

                        21    11    Alexis       Smith        Alexis       Smith          13      0.0       0

Ksharp
Super User

Obviously, the following code contains little valuable information for your question. It is just a clue.

Your question is really a big challenge actually, especially for complicated situation.

Before, I remember Patrick give an Hash example for such question at a long time ago.

But not quit sure it is suited for your demand.

data have;
input id name & $20.;
cards;
1 Alex Smith
2 Alexander Smith
3 Smith Alex
4 Smith Alexander
5 Matt Roberts
6 Matthew Robertson
7 Robert Matthews
8 Robert Mattherson
9 Matt Thompson
10 Tom Matterhorn
11 Alexis Smith
12 John Smith
;
run;


proc sql;
create table want as
 select a.id as a_id,a.name as a_name,b.id as b_id,b.name as b_name
  from have as a,have as b
   where a.name =* b.name;
quit;







Ksharp

FriedEgg
SAS Employee

Your question is definitly highly complex.  I have worked at companies who spend hundred of thousands of dollars a year developing and maintaining name cleansing and matching algorithms.  There is nicknaming, abreviations, spelling mistakes, name reversal, gender qualifing, multiple variations of soundexing (just to name a few approaches to consider), not a simple task to accomplish and with only name you will never know that you are not going to merge different people with the same name anyway...

DLing
Obsidian | Level 7

And all of this is just a pre-cursor to true text mining where latent semantics analysis and content categorization via singular value decomposition happens...

Ksharp
Super User

SAS has a Product - DataFlux which looks like can do it.

Hear it from someone. Smiley Happy

Ksharp

Stuart_sas_pubs
SAS Employee


Yes, this sounds a lot like something you could use the entity resolution functions in DataFlux Data Management Studio for.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 1516 views
  • 0 likes
  • 5 in conversation