Hello Guys,
I have 2 datssets i wnat to mtach the new dataset with the old one and gives %of match like shown below
Company master file like this :
Company_Master |
Adamas Pharmaceuticals, Inc. |
Acacia Communications, Inc. |
McHugh Inc |
I will recive new compay names frequently like this
Company_New |
Adamas Pharmaceuticals, Inc. |
Adimas Pharmaceuticals, Inc. |
McHugh Inc |
Mc Hugh innc |
Mccley INC |
Acacia Communications |
I want output like below: match with the master file and gives the % match with the master file names..any ideas i am looking at some compare functions but that not giving satisfactory results. any thoughts?
Company_New | % Matched |
Adamas Pharmaceuticals LLC | 90 |
Adimas Pharmaceuticals | 80 |
Adamas Pharmaceuticals, Inc. | 100 |
McHugh Inc | 100 |
Mc Hugh innc | 90 |
Mccley INC | 0 |
Acacia Communications | 0 |
Thanks.
How do you computer that %Matched . Check the function about spell distance and edit distance Like : spedis(), complev() .. data master; input master $80.; cards; Adamas Pharmaceuticals, Inc. Acacia Communications, Inc. McHugh Inc ; run; data new; input new $80.; cards; Adamas Pharmaceuticals, Inc. Adimas Pharmaceuticals, Inc. McHugh Inc Mc Hugh innc Mccley INC Acacia Communications ; run; proc sql; select new,count(*)/(select count(*) from master) as per format=percent8.2 from new as a,master as b where a.new =* b.master group by a.new; quit;
Thanks for the reply.
%matched column thats not caluculated that i just want to show this type of output am expecting...
based on the output its just dividing the count/total count,
i am expecting is ther any method that will give probability of match?
Sorry. I don't know what you are talking about. How do you define that ' probability of match '? How to calculate ' probability of match' ? Gives us an example.
If you want all the obs of Company_New : data master; input master $80.; cards; Adamas Pharmaceuticals, Inc. Acacia Communications, Inc. McHugh Inc ; run; data new; input new $80.; cards; Adamas Pharmaceuticals, Inc. Adimas Pharmaceuticals, Inc. McHugh Inc Mc Hugh innc Mccley INC Acacia Communications ; run; proc sql; select new,count(master)/(select count(master) from master) as per format=percent8.2 from new as a left join master as b on a.new =* b.master group by a.new; quit;
Example of using edit distance to do some matching with function complev:
data master;
input master &:$upcase80.;
cards;
Adamas Pharmaceuticals, Inc.
Acacia Communications, Inc.
McHugh Inc
;
data new;
input new &:$upcase80.;
cards;
Adamas Pharmaceuticals, LLC
Adamas Pharmaceuticals, Inc.
Adimas Pharmaceuticals
McHugh Inc
Mc Hugh innc
Mccley INC
Acacia Communications
;
proc sql;
create table matches as
select
new, master,
complev(new, master, 8) as complev,
calculated complev / length(master) as distance
from
new inner join
master on substr(new,1,2) = substr(master,1,2)
where calculated complev < 8;
select * from matches
group by new
having distance = min(distance);
quit;
Note how I divided edit distance by string length to give a relative measure of spelling error. Could be useful to filter out some matches between short names.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.