Hello Guys,
I have 2 datssets i wnat to mtach the new dataset with the old one and gives %of match like shown below
Company master file like this :
Company_Master |
Adamas Pharmaceuticals, Inc. |
Acacia Communications, Inc. |
McHugh Inc |
I will recive new compay names frequently like this
Company_New |
Adamas Pharmaceuticals, Inc. |
Adimas Pharmaceuticals, Inc. |
McHugh Inc |
Mc Hugh innc |
Mccley INC |
Acacia Communications |
I want output like below: match with the master file and gives the % match with the master file names..any ideas i am looking at some compare functions but that not giving satisfactory results. any thoughts?
Company_New | % Matched |
Adamas Pharmaceuticals LLC | 90 |
Adimas Pharmaceuticals | 80 |
Adamas Pharmaceuticals, Inc. | 100 |
McHugh Inc | 100 |
Mc Hugh innc | 90 |
Mccley INC | 0 |
Acacia Communications | 0 |
Thanks.
How do you computer that %Matched . Check the function about spell distance and edit distance Like : spedis(), complev() .. data master; input master $80.; cards; Adamas Pharmaceuticals, Inc. Acacia Communications, Inc. McHugh Inc ; run; data new; input new $80.; cards; Adamas Pharmaceuticals, Inc. Adimas Pharmaceuticals, Inc. McHugh Inc Mc Hugh innc Mccley INC Acacia Communications ; run; proc sql; select new,count(*)/(select count(*) from master) as per format=percent8.2 from new as a,master as b where a.new =* b.master group by a.new; quit;
Thanks for the reply.
%matched column thats not caluculated that i just want to show this type of output am expecting...
based on the output its just dividing the count/total count,
i am expecting is ther any method that will give probability of match?
Sorry. I don't know what you are talking about. How do you define that ' probability of match '? How to calculate ' probability of match' ? Gives us an example.
If you want all the obs of Company_New : data master; input master $80.; cards; Adamas Pharmaceuticals, Inc. Acacia Communications, Inc. McHugh Inc ; run; data new; input new $80.; cards; Adamas Pharmaceuticals, Inc. Adimas Pharmaceuticals, Inc. McHugh Inc Mc Hugh innc Mccley INC Acacia Communications ; run; proc sql; select new,count(master)/(select count(master) from master) as per format=percent8.2 from new as a left join master as b on a.new =* b.master group by a.new; quit;
Example of using edit distance to do some matching with function complev:
data master;
input master &:$upcase80.;
cards;
Adamas Pharmaceuticals, Inc.
Acacia Communications, Inc.
McHugh Inc
;
data new;
input new &:$upcase80.;
cards;
Adamas Pharmaceuticals, LLC
Adamas Pharmaceuticals, Inc.
Adimas Pharmaceuticals
McHugh Inc
Mc Hugh innc
Mccley INC
Acacia Communications
;
proc sql;
create table matches as
select
new, master,
complev(new, master, 8) as complev,
calculated complev / length(master) as distance
from
new inner join
master on substr(new,1,2) = substr(master,1,2)
where calculated complev < 8;
select * from matches
group by new
having distance = min(distance);
quit;
Note how I divided edit distance by string length to give a relative measure of spelling error. Could be useful to filter out some matches between short names.
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.