Re: How do I match names

kumarK · Posted 08-21-2016 03:33 AM

Hello Guys,

I have 2 datssets i wnat to mtach the new dataset with the old one and gives %of match like shown below

Company master file like this :

Company_Master

Adamas Pharmaceuticals, Inc.

Acacia Communications, Inc.

McHugh Inc

I will recive new compay names frequently like this

Company_New

Adamas Pharmaceuticals, Inc.

Adimas Pharmaceuticals, Inc.

McHugh Inc

Mc Hugh innc

Mccley INC

Acacia Communications

I want output like below: match with the master file and gives the % match with the master file names..any ideas i am looking at some compare functions but that not giving satisfactory results. any thoughts?

Company_New	% Matched
Adamas Pharmaceuticals LLC	90
Adimas Pharmaceuticals	80
Adamas Pharmaceuticals, Inc.	100
McHugh Inc	100
Mc Hugh innc	90
Mccley INC	0
Acacia Communications	0

Thanks.

Ksharp · Posted 08-21-2016 04:34 AM

How do you computer that %Matched .
Check the function about spell distance and edit distance Like : spedis(), complev() ..



data master;
input master $80.;
cards;
Adamas Pharmaceuticals, Inc.
Acacia Communications, Inc.
McHugh Inc
;
run;

data new;
input new $80.;
cards;
Adamas Pharmaceuticals, Inc.
Adimas Pharmaceuticals, Inc.
McHugh Inc
Mc Hugh innc
Mccley INC
Acacia Communications
;
run;
proc sql;
select new,count(*)/(select count(*) from master) as per format=percent8.2
 from new as a,master as b
  where a.new =* b.master
   group by a.new;
quit;

kumarK · Posted 08-21-2016 05:01 AM

Thanks for the reply.

%matched column thats not caluculated that i just want to show this type of output am expecting...

based on the output its just dividing the count/total count,

i am expecting is ther any method that will give probability of match?

Ksharp · Posted 08-22-2016 04:08 AM

Sorry. I don't know what you are talking about.
How do you define that ' probability of match '?
How to calculate ' probability of match' ?
Gives us an example.

Ksharp · Posted 08-21-2016 04:42 AM

If you want all the obs of Company_New :



data master;
input master $80.;
cards;
Adamas Pharmaceuticals, Inc.
Acacia Communications, Inc.
McHugh Inc
;
run;

data new;
input new $80.;
cards;
Adamas Pharmaceuticals, Inc.
Adimas Pharmaceuticals, Inc.
McHugh Inc
Mc Hugh innc
Mccley INC
Acacia Communications
;
run;
proc sql;
select new,count(master)/(select count(master) from master) as per format=percent8.2
 from new as a left join master as b
  on a.new =* b.master
   group by a.new;
quit;

PGStats · Posted 08-21-2016 03:57 PM

Example of using edit distance to do some matching with function complev:

data master;
input master &:$upcase80.;
cards;
Adamas Pharmaceuticals, Inc.
Acacia Communications, Inc.
McHugh Inc
;

data new;
input new &:$upcase80.;
cards;
Adamas Pharmaceuticals, LLC
Adamas Pharmaceuticals, Inc.
Adimas Pharmaceuticals
McHugh Inc
Mc Hugh innc
Mccley INC
Acacia Communications
;

proc sql;
create table matches as
select 
    new, master, 
    complev(new, master, 8) as complev,
    calculated complev / length(master) as distance
from 
    new inner join 
    master on substr(new,1,2) = substr(master,1,2)
where calculated complev < 8;
select * from matches
group by new
having distance = min(distance);
quit;

Note how I divided edit distance by string length to give a relative measure of spelling error. Could be useful to filter out some matches between short names.

PG

How do I match names