Programming the statistical procedures from SAS

alternative of text miner in sas base (nut cracker please help)

Reply
Frequent Contributor
Posts: 126

alternative of text miner in sas base (nut cracker please help)

Hi all,

Kinda new in the SAS world so i would gladly take your advice. My issue is as following: I dont possess text miner and i am trying to handle a case with text via SAS Base.

So what i want to figure out is to recognize if in a group of rows (5,7,3) the names match, meaning

if i have for example Alex Smith  in row 1 and Smith Alex in row 2 that the program will figure that it is the same name.

In addition there can be a rows where the name is   Smith Alexander  or Smith Alex which is the same name and i would like SAS to recognize that.

That means that if two rows have at least 2 words in common (the total for each row would be lets say 3 words) i would like to find a command so that SAS can consider them the same and therefore place them in the same group.

I hope it makes sense and hope in addition that any advice can be found here.

Thnx in advance

Trusted Advisor
Posts: 1,301

alternative of text miner in sas base (nut cracker please help)

The following definitly doesn't work but it may give you or others some ideas.  I am not good with hash objects but I assume my example could be duplicated using one instead of my proc sql to explode the inputed records.  Name matching algorithms are extremly complex and this is just a very dirty, low-quality match assurance test...  In the output of the linkage set you can clearly see some matches are definitly not desireable, but I consider the results good for a few minutes work... Smiley Happy

data have;

input id (namep1-namep2) (:$20.);

retain yes 1;

cards;

1 Alex Smith

2 Alexander Smith

3 Smith Alex

4 Smith Alexander

5 Matt Roberts

6 Matthew Robertson

7 Robert Matthews

8 Robert Mattherson

9 Matt Thompson

10 Tom Matterhorn

11 Alexis Smith

12 John Smith

13 Smith Alexis

;

run;

proc sql;

create table want as

select a.id, compress(a.namep1,,'kw') as namep1, compress(a.namep2,,'kw') as namep2, compress(b.namep1,,'kw') as namep1x, compress(b.namep2,,'kw') as namep2x, b.id as idx

   from have a

   join have b on a.id lt b.id;

quit;

data linkage;

set want;

/*----------------------------------------------*

  |  i = ignore case                             |

  |  l = remove leading blanks                   |

  |  : = truncate to shortest length and compare |

  *----------------------------------------------*/

  /* get complev of a:a */

  complev1=complev(namep1,namep1x,'il');

  complev2=complev(namep2,namep2x,'il');

  complev1t=complev(namep1,namep1x,'il:');

  complev2t=complev(namep2,namep2x,'il:');

  avgcomplev_aa=mean(of complev1,complev2,complev1t,complev2t);

  if avgcomlev_aa<3 then

   do;

    complev=avgcomplev_aa;

          aa=1;

   end;

  /* get complev of a:b */

  complev1x=complev(namep1,namep2x,'il');

  complev2x=complev(namep2,namep1x,'il');

  _complev1xt=complev(namep1,namep2x,'il:');

  _complev2xt=complev(namep2,namep1x,'il:');

  avgcomplev_ab=mean(of complev1x,complev2x,complev1xt,complev2xt);

  if avgcomplev_ab<3 then

   do;

          complev=avgcomplev_ab;

          _namep2x=namep2x;

          namep2x=namep1x;

          namep1x=_namep2x;

          aa=0;

   end;

  if complev<3;

keep id idx namep1 namep2 namep1x namep2x complev aa;

run;

proc print data=linkage; run;

                       Obs    id    namep1       namep2       namep1x      namep2x       idx    complev    aa

                         1     1    Alex         Smith        Alexander    Smith           2      2.5       1

                         2     1    Alex         Smith        Alex         Smith           3      0.0       0

                         3     1    Alex         Smith        Alexander    Smith           4      2.5       0

                         4     1    Alex         Smith        Alexis       Smith          11      1.0       1

                         5     1    Alex         Smith        John         Smith          12      2.0       1

                         6     1    Alex         Smith        Alexis       Smith          13      1.0       0

                         7     2    Alexander    Smith        Alex         Smith           3      2.5       0

                         8     2    Alexander    Smith        Alexander    Smith           4      0.0       0

                         9     2    Alexander    Smith        Alexis       Smith          11      2.5       1

                        10     2    Alexander    Smith        Alexis       Smith          13      2.5       0

                        11     3    Smith        Alex         Smith        Alexander       4      2.5       1

                        12     3    Smith        Alex         Smith        Alexis         11      1.0       0

                        13     3    Smith        Alex         Smith        John           12      2.0       0

                        14     3    Smith        Alex         Smith        Alexis         13      1.0       1

                        15     4    Smith        Alexander    Smith        Alexis         11      2.5       0

                        16     4    Smith        Alexander    Smith        Alexis         13      2.5       1

                        17     5    Matt         Roberts      Matthew      Robertson       6      2.5       1

                        18     5    Matt         Roberts      Matthews     Robert          7      2.5       0

                        19     6    Matthew      Robertson    Matthews     Robert          7      2.0       0

                        20     7    Robert       Matthews     Robert       Mattherson      8      1.5       1

                        21    11    Alexis       Smith        Alexis       Smith          13      0.0       0

Super User
Posts: 9,769

alternative of text miner in sas base (nut cracker please help)

Obviously, the following code contains little valuable information for your question. It is just a clue.

Your question is really a big challenge actually, especially for complicated situation.

Before, I remember Patrick give an Hash example for such question at a long time ago.

But not quit sure it is suited for your demand.

data have;
input id name & $20.;
cards;
1 Alex Smith
2 Alexander Smith
3 Smith Alex
4 Smith Alexander
5 Matt Roberts
6 Matthew Robertson
7 Robert Matthews
8 Robert Mattherson
9 Matt Thompson
10 Tom Matterhorn
11 Alexis Smith
12 John Smith
;
run;


proc sql;
create table want as
 select a.id as a_id,a.name as a_name,b.id as b_id,b.name as b_name
  from have as a,have as b
   where a.name =* b.name;
quit;







Ksharp

Trusted Advisor
Posts: 1,301

alternative of text miner in sas base (nut cracker please help)

Your question is definitly highly complex.  I have worked at companies who spend hundred of thousands of dollars a year developing and maintaining name cleansing and matching algorithms.  There is nicknaming, abreviations, spelling mistakes, name reversal, gender qualifing, multiple variations of soundexing (just to name a few approaches to consider), not a simple task to accomplish and with only name you will never know that you are not going to merge different people with the same name anyway...

Frequent Contributor
Posts: 104

Re: alternative of text miner in sas base (nut cracker please help)

And all of this is just a pre-cursor to true text mining where latent semantics analysis and content categorization via singular value decomposition happens...

Super User
Posts: 9,769

Re: alternative of text miner in sas base (nut cracker please help)

SAS has a Product - DataFlux which looks like can do it.

Hear it from someone. Smiley Happy

Ksharp

SAS Employee
Posts: 13

Re: alternative of text miner in sas base (nut cracker please help)


Yes, this sounds a lot like something you could use the entity resolution functions in DataFlux Data Management Studio for.

Ask a Question
Discussion stats
  • 6 replies
  • 261 views
  • 0 likes
  • 5 in conversation