Hi all,
Kinda new in the SAS world so i would gladly take your advice. My issue is as following: I dont possess text miner and i am trying to handle a case with text via SAS Base.
So what i want to figure out is to recognize if in a group of rows (5,7,3) the names match, meaning
if i have for example Alex Smith in row 1 and Smith Alex in row 2 that the program will figure that it is the same name.
In addition there can be a rows where the name is Smith Alexander or Smith Alex which is the same name and i would like SAS to recognize that.
That means that if two rows have at least 2 words in common (the total for each row would be lets say 3 words) i would like to find a command so that SAS can consider them the same and therefore place them in the same group.
I hope it makes sense and hope in addition that any advice can be found here.
Thnx in advance
The following definitly doesn't work but it may give you or others some ideas. I am not good with hash objects but I assume my example could be duplicated using one instead of my proc sql to explode the inputed records. Name matching algorithms are extremly complex and this is just a very dirty, low-quality match assurance test... In the output of the linkage set you can clearly see some matches are definitly not desireable, but I consider the results good for a few minutes work...
data have;
input id (namep1-namep2) (:$20.);
retain yes 1;
cards;
1 Alex Smith
2 Alexander Smith
3 Smith Alex
4 Smith Alexander
5 Matt Roberts
6 Matthew Robertson
7 Robert Matthews
8 Robert Mattherson
9 Matt Thompson
10 Tom Matterhorn
11 Alexis Smith
12 John Smith
13 Smith Alexis
;
run;
proc sql;
create table want as
select a.id, compress(a.namep1,,'kw') as namep1, compress(a.namep2,,'kw') as namep2, compress(b.namep1,,'kw') as namep1x, compress(b.namep2,,'kw') as namep2x, b.id as idx
from have a
join have b on a.id lt b.id;
quit;
data linkage;
set want;
/*----------------------------------------------*
| i = ignore case |
| l = remove leading blanks |
| : = truncate to shortest length and compare |
*----------------------------------------------*/
/* get complev of a:a */
complev1=complev(namep1,namep1x,'il');
complev2=complev(namep2,namep2x,'il');
complev1t=complev(namep1,namep1x,'il:');
complev2t=complev(namep2,namep2x,'il:');
avgcomplev_aa=mean(of complev1,complev2,complev1t,complev2t);
if avgcomlev_aa<3 then
do;
complev=avgcomplev_aa;
aa=1;
end;
/* get complev of a:b */
complev1x=complev(namep1,namep2x,'il');
complev2x=complev(namep2,namep1x,'il');
_complev1xt=complev(namep1,namep2x,'il:');
_complev2xt=complev(namep2,namep1x,'il:');
avgcomplev_ab=mean(of complev1x,complev2x,complev1xt,complev2xt);
if avgcomplev_ab<3 then
do;
complev=avgcomplev_ab;
_namep2x=namep2x;
namep2x=namep1x;
namep1x=_namep2x;
aa=0;
end;
if complev<3;
keep id idx namep1 namep2 namep1x namep2x complev aa;
run;
proc print data=linkage; run;
Obs id namep1 namep2 namep1x namep2x idx complev aa
1 1 Alex Smith Alexander Smith 2 2.5 1
2 1 Alex Smith Alex Smith 3 0.0 0
3 1 Alex Smith Alexander Smith 4 2.5 0
4 1 Alex Smith Alexis Smith 11 1.0 1
5 1 Alex Smith John Smith 12 2.0 1
6 1 Alex Smith Alexis Smith 13 1.0 0
7 2 Alexander Smith Alex Smith 3 2.5 0
8 2 Alexander Smith Alexander Smith 4 0.0 0
9 2 Alexander Smith Alexis Smith 11 2.5 1
10 2 Alexander Smith Alexis Smith 13 2.5 0
11 3 Smith Alex Smith Alexander 4 2.5 1
12 3 Smith Alex Smith Alexis 11 1.0 0
13 3 Smith Alex Smith John 12 2.0 0
14 3 Smith Alex Smith Alexis 13 1.0 1
15 4 Smith Alexander Smith Alexis 11 2.5 0
16 4 Smith Alexander Smith Alexis 13 2.5 1
17 5 Matt Roberts Matthew Robertson 6 2.5 1
18 5 Matt Roberts Matthews Robert 7 2.5 0
19 6 Matthew Robertson Matthews Robert 7 2.0 0
20 7 Robert Matthews Robert Mattherson 8 1.5 1
21 11 Alexis Smith Alexis Smith 13 0.0 0
Obviously, the following code contains little valuable information for your question. It is just a clue.
Your question is really a big challenge actually, especially for complicated situation.
Before, I remember Patrick give an Hash example for such question at a long time ago.
But not quit sure it is suited for your demand.
data have; input id name & $20.; cards; 1 Alex Smith 2 Alexander Smith 3 Smith Alex 4 Smith Alexander 5 Matt Roberts 6 Matthew Robertson 7 Robert Matthews 8 Robert Mattherson 9 Matt Thompson 10 Tom Matterhorn 11 Alexis Smith 12 John Smith ; run; proc sql; create table want as select a.id as a_id,a.name as a_name,b.id as b_id,b.name as b_name from have as a,have as b where a.name =* b.name; quit;
Ksharp
Your question is definitly highly complex. I have worked at companies who spend hundred of thousands of dollars a year developing and maintaining name cleansing and matching algorithms. There is nicknaming, abreviations, spelling mistakes, name reversal, gender qualifing, multiple variations of soundexing (just to name a few approaches to consider), not a simple task to accomplish and with only name you will never know that you are not going to merge different people with the same name anyway...
And all of this is just a pre-cursor to true text mining where latent semantics analysis and content categorization via singular value decomposition happens...
SAS has a Product - DataFlux which looks like can do it.
Hear it from someone.
Ksharp
Yes, this sounds a lot like something you could use the entity resolution functions in DataFlux Data Management Studio for.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.