Hello Guys,
I have 2 datssets i wnat to mtach the new dataset with the old one and gives %of match like shown below
Company master file like this :
| Company_Master |
| Adamas Pharmaceuticals, Inc. |
| Acacia Communications, Inc. |
| McHugh Inc |
I will recive new compay names frequently like this
| Company_New |
| Adamas Pharmaceuticals, Inc. |
| Adimas Pharmaceuticals, Inc. |
| McHugh Inc |
| Mc Hugh innc |
| Mccley INC |
| Acacia Communications |
I want output like below: match with the master file and gives the % match with the master file names..any ideas i am looking at some compare functions but that not giving satisfactory results. any thoughts?
| Company_New | % Matched |
| Adamas Pharmaceuticals LLC | 90 |
| Adimas Pharmaceuticals | 80 |
| Adamas Pharmaceuticals, Inc. | 100 |
| McHugh Inc | 100 |
| Mc Hugh innc | 90 |
| Mccley INC | 0 |
| Acacia Communications | 0 |
Thanks.
How do you computer that %Matched . Check the function about spell distance and edit distance Like : spedis(), complev() .. data master; input master $80.; cards; Adamas Pharmaceuticals, Inc. Acacia Communications, Inc. McHugh Inc ; run; data new; input new $80.; cards; Adamas Pharmaceuticals, Inc. Adimas Pharmaceuticals, Inc. McHugh Inc Mc Hugh innc Mccley INC Acacia Communications ; run; proc sql; select new,count(*)/(select count(*) from master) as per format=percent8.2 from new as a,master as b where a.new =* b.master group by a.new; quit;
Thanks for the reply.
%matched column thats not caluculated that i just want to show this type of output am expecting...
based on the output its just dividing the count/total count,
i am expecting is ther any method that will give probability of match?
Sorry. I don't know what you are talking about. How do you define that ' probability of match '? How to calculate ' probability of match' ? Gives us an example.
If you want all the obs of Company_New : data master; input master $80.; cards; Adamas Pharmaceuticals, Inc. Acacia Communications, Inc. McHugh Inc ; run; data new; input new $80.; cards; Adamas Pharmaceuticals, Inc. Adimas Pharmaceuticals, Inc. McHugh Inc Mc Hugh innc Mccley INC Acacia Communications ; run; proc sql; select new,count(master)/(select count(master) from master) as per format=percent8.2 from new as a left join master as b on a.new =* b.master group by a.new; quit;
Example of using edit distance to do some matching with function complev:
data master;
input master &:$upcase80.;
cards;
Adamas Pharmaceuticals, Inc.
Acacia Communications, Inc.
McHugh Inc
;
data new;
input new &:$upcase80.;
cards;
Adamas Pharmaceuticals, LLC
Adamas Pharmaceuticals, Inc.
Adimas Pharmaceuticals
McHugh Inc
Mc Hugh innc
Mccley INC
Acacia Communications
;
proc sql;
create table matches as
select
new, master,
complev(new, master, 8) as complev,
calculated complev / length(master) as distance
from
new inner join
master on substr(new,1,2) = substr(master,1,2)
where calculated complev < 8;
select * from matches
group by new
having distance = min(distance);
quit;
Note how I divided edit distance by string length to give a relative measure of spelling error. Could be useful to filter out some matches between short names.
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.