Re: alternative of text miner in sas base (nut cracker please help)

chemicalab · Posted 09-08-2011 09:34 AM

Hi all,

Kinda new in the SAS world so i would gladly take your advice. My issue is as following: I dont possess text miner and i am trying to handle a case with text via SAS Base.

So what i want to figure out is to recognize if in a group of rows (5,7,3) the names match, meaning

if i have for example Alex Smith in row 1 and Smith Alex in row 2 that the program will figure that it is the same name.

In addition there can be a rows where the name is Smith Alexander or Smith Alex which is the same name and i would like SAS to recognize that.

That means that if two rows have at least 2 words in common (the total for each row would be lets say 3 words) i would like to find a command so that SAS can consider them the same and therefore place them in the same group.

I hope it makes sense and hope in addition that any advice can be found here.

Thnx in advance

FriedEgg · Posted 09-08-2011 03:34 PM

The following definitly doesn't work but it may give you or others some ideas. I am not good with hash objects but I assume my example could be duplicated using one instead of my proc sql to explode the inputed records. Name matching algorithms are extremly complex and this is just a very dirty, low-quality match assurance test... In the output of the linkage set you can clearly see some matches are definitly not desireable, but I consider the results good for a few minutes work...

data have;

input id (namep1-namep2) (:$20.);

retain yes 1;

cards;

1 Alex Smith

2 Alexander Smith

3 Smith Alex

4 Smith Alexander

5 Matt Roberts

6 Matthew Robertson

7 Robert Matthews

8 Robert Mattherson

9 Matt Thompson

10 Tom Matterhorn

11 Alexis Smith

12 John Smith

13 Smith Alexis

;

run;

proc sql;

create table want as

select a.id, compress(a.namep1,,'kw') as namep1, compress(a.namep2,,'kw') as namep2, compress(b.namep1,,'kw') as namep1x, compress(b.namep2,,'kw') as namep2x, b.id as idx

from have a

join have b on a.id lt b.id;

quit;

data linkage;

set want;

/*----------------------------------------------*

| i = ignore case |

| l = remove leading blanks |

| : = truncate to shortest length and compare |

*----------------------------------------------*/

/* get complev of a:a */

complev1=complev(namep1,namep1x,'il');

complev2=complev(namep2,namep2x,'il');

complev1t=complev(namep1,namep1x,'il:');

complev2t=complev(namep2,namep2x,'il:');

avgcomplev_aa=mean(of complev1,complev2,complev1t,complev2t);

if avgcomlev_aa<3 then

do;

complev=avgcomplev_aa;

aa=1;

end;

/* get complev of a:b */

complev1x=complev(namep1,namep2x,'il');

complev2x=complev(namep2,namep1x,'il');

_complev1xt=complev(namep1,namep2x,'il:');

_complev2xt=complev(namep2,namep1x,'il:');

avgcomplev_ab=mean(of complev1x,complev2x,complev1xt,complev2xt);

if avgcomplev_ab<3 then

do;

complev=avgcomplev_ab;

_namep2x=namep2x;

namep2x=namep1x;

namep1x=_namep2x;

aa=0;

end;

if complev<3;

keep id idx namep1 namep2 namep1x namep2x complev aa;

run;

proc print data=linkage; run;

Obs id namep1 namep2 namep1x namep2x idx complev aa

1 1 Alex Smith Alexander Smith 2 2.5 1

2 1 Alex Smith Alex Smith 3 0.0 0

3 1 Alex Smith Alexander Smith 4 2.5 0

4 1 Alex Smith Alexis Smith 11 1.0 1

5 1 Alex Smith John Smith 12 2.0 1

6 1 Alex Smith Alexis Smith 13 1.0 0

7 2 Alexander Smith Alex Smith 3 2.5 0

8 2 Alexander Smith Alexander Smith 4 0.0 0

9 2 Alexander Smith Alexis Smith 11 2.5 1

10 2 Alexander Smith Alexis Smith 13 2.5 0

11 3 Smith Alex Smith Alexander 4 2.5 1

12 3 Smith Alex Smith Alexis 11 1.0 0

13 3 Smith Alex Smith John 12 2.0 0

14 3 Smith Alex Smith Alexis 13 1.0 1

15 4 Smith Alexander Smith Alexis 11 2.5 0

16 4 Smith Alexander Smith Alexis 13 2.5 1

17 5 Matt Roberts Matthew Robertson 6 2.5 1

18 5 Matt Roberts Matthews Robert 7 2.5 0

19 6 Matthew Robertson Matthews Robert 7 2.0 0

20 7 Robert Matthews Robert Mattherson 8 1.5 1

21 11 Alexis Smith Alexis Smith 13 0.0 0

Ksharp · Posted 09-08-2011 11:41 PM

Obviously, the following code contains little valuable information for your question. It is just a clue.

Your question is really a big challenge actually, especially for complicated situation.

Before, I remember Patrick give an Hash example for such question at a long time ago.

But not quit sure it is suited for your demand.

data have;
input id name & $20.;
cards;
1 Alex Smith
2 Alexander Smith
3 Smith Alex
4 Smith Alexander
5 Matt Roberts
6 Matthew Robertson
7 Robert Matthews
8 Robert Mattherson
9 Matt Thompson
10 Tom Matterhorn
11 Alexis Smith
12 John Smith
;
run;


proc sql;
create table want as
 select a.id as a_id,a.name as a_name,b.id as b_id,b.name as b_name
  from have as a,have as b
   where a.name =* b.name;
quit;

Ksharp

FriedEgg · Posted 09-09-2011 02:26 PM

Your question is definitly highly complex. I have worked at companies who spend hundred of thousands of dollars a year developing and maintaining name cleansing and matching algorithms. There is nicknaming, abreviations, spelling mistakes, name reversal, gender qualifing, multiple variations of soundexing (just to name a few approaches to consider), not a simple task to accomplish and with only name you will never know that you are not going to merge different people with the same name anyway...

DLing · Posted 09-09-2011 04:28 PM

And all of this is just a pre-cursor to true text mining where latent semantics analysis and content categorization via singular value decomposition happens...

Ksharp · Posted 09-14-2011 02:11 AM

SAS has a Product - DataFlux which looks like can do it.

Hear it from someone.

Ksharp

Stuart_sas_pubs · Posted 09-23-2011 02:09 PM

Yes, this sounds a lot like something you could use the entity resolution functions in DataFlux Data Management Studio for.

alternative of text miner in sas base (nut cracker please help)

alternative of text miner in sas base (nut cracker please help)

alternative of text miner in sas base (nut cracker please help)

alternative of text miner in sas base (nut cracker please help)