Hi All,
Looking for best options to match values in two tables
I have one dataset that has the below data that has company names
data COMPANY_DATA;
input ENUM $ COMPANY_NAME $ ;
datalines;
D1234 COMPAX LTD
D1256 TEST NOTE LTD
D1345 LISTKK ENTERPRISES
D2234 ZIVOKA
D4534 LIBORD NUKA PVT
D7887 ZIMZUM COLLEGE
;
run;
I have another two datasets that has possible first names and second names
data firstnames;
fname $;
peter
sam
zivo
zimzumtu
run;
data secondnames;
sname $;
peterson
kane
bargh
anderson
run;
The first names and second names dataset is fairly.
What I want to highlight for each row in the COMPANY_DATA dataset is if there is rough match to a primary and/or second name in the company name partially or completely. For example there could be a primary name say mark and the company name has Marke in it..it should be a possible match but if the company name has Market Research then that is not a possible match. So basically as close to the primary name listed in the firstnames dataset.
What is the best option to achieve the above ? Thank you in advance.
Due to lack of time, this is not a full solution and there are probably better ways, but I amended the input data to reflect the company names you showed in your output and added an extra row to secondnames to make sure the matching process worked, as the original data in secondnames had no matches.
Extra processing will be required to remove the extra rows that are generated by the proc sql, but this should give you some ideas on how your data might be processed:
data COMPANY_DATA;
input ENUM $5. COMPANY_NAME $25.;
datalines;
D1234 COMPAX LTD
D1256 TEST NOTE LTD
D1345 LISTKK ENTERPRISES
D2234 ZIVOKA
D4534 NUKA ALEX PVT
D7887 ZIMZUM COLLEGE
;
data firstnames;
input fname $;
datalines;
peter
sam
zivoka
zimzum
alex
;
data secondnames;
input sname $;
datalines;
peterson
kane
bargh
anderson
note
;
proc sql noprint;
create table
combined
as
select distinct
c.*
,ifc(find(company_name,fname,'it'),'Y','N') as fname_flag
,find(company_name,fname,'it') as fname_pos
,ifc(find(company_name,sname,'it'),'Y','N') as sname_flag
,find(company_name,sname,'it') as sname_pos
from
company_data c
,firstnames
,secondnames
;
quit;
Thanks & kind regards,
Amir.
let me know if you need more clarity on my query. or if I need to explain it better
Hi @Anuz,
Thanks for supplying the data in data steps. For the benefit of all I've tried to make the code more readable and some of the steps needed tweaking (see further below).
Please supply another data step with datalines showing the resulting data you want to see in the output, based on the input data you have given.
data COMPANY_DATA;
input ENUM $ COMPANY_NAME $ ;
datalines;
D1234 COMPAX LTD
D1256 TEST NOTE LTD
D1345 LISTKK ENTERPRISES
D2234 ZIVOKA
D4534 LIBORD NUKA PVT
D7887 ZIMZUM COLLEGE
;
data firstnames;
input fname $;
datalines;
peter
sam
zivo
zimzumtu
;
data secondnames;
input sname $;
datalines;
peterson
kane
bargh
anderson
;
Thanks & kind regards,
Amir.
Thank you @Amir
Something like the below as output would be good as it will be useful for the end user to understand the analysis. However I am open to any other suggestion of output.
this is on the assumption that in the firstnames dataset - zivoka , alex and zimzum are listed as accepted first names.
Hope this helps.
data COMPANY_DATA;
input ENUM $5. COMPANY_NAME $25. FNAME_FLAG $1. FNAME_POS 8.;
datalines;
D1234 COMPAX LTD N 0
D1256 TEST NOTE LTD N 0
D1345 LISTKK ENTERPRISES N 0
D2234 ZIVOKA Y 1
D4534 NUKA ALEX PVT Y 5
D7887 ZIMZUM COLLEGE Y 1
;
run;
You mention that:
this is on the assumption that in the firstnames dataset - zivoka , alex and zimzum are listed as accepted first names.
Thanks & kind regards,
Amir.
data firstnames;
input fname $;
datalines;
peter
sam
zivoka
zimzum
alex
;
Hi @Amir . Please take the above for the firstnames dataset.
You are right, accepted first names means the names in the firstnames dataset.
The values in the firstnames dataset if found anywhere in the company dataset then that needs to be highlighted/identified by the code.
So for example if the company name is
a. PoundAlex Ltd or
b. AlexPound Ltd or
c. Alex Ltd or
d. Boxing Alex Ltd
in all the both the cases Alex needs to be identified.
Hope I have managed to clarify all the questions you raised. Thank you
Due to lack of time, this is not a full solution and there are probably better ways, but I amended the input data to reflect the company names you showed in your output and added an extra row to secondnames to make sure the matching process worked, as the original data in secondnames had no matches.
Extra processing will be required to remove the extra rows that are generated by the proc sql, but this should give you some ideas on how your data might be processed:
data COMPANY_DATA;
input ENUM $5. COMPANY_NAME $25.;
datalines;
D1234 COMPAX LTD
D1256 TEST NOTE LTD
D1345 LISTKK ENTERPRISES
D2234 ZIVOKA
D4534 NUKA ALEX PVT
D7887 ZIMZUM COLLEGE
;
data firstnames;
input fname $;
datalines;
peter
sam
zivoka
zimzum
alex
;
data secondnames;
input sname $;
datalines;
peterson
kane
bargh
anderson
note
;
proc sql noprint;
create table
combined
as
select distinct
c.*
,ifc(find(company_name,fname,'it'),'Y','N') as fname_flag
,find(company_name,fname,'it') as fname_pos
,ifc(find(company_name,sname,'it'),'Y','N') as sname_flag
,find(company_name,sname,'it') as sname_pos
from
company_data c
,firstnames
,secondnames
;
quit;
Thanks & kind regards,
Amir.
Thank you @Amir .
Appreciate your time and patience. This works for me. I was able to use your method and adapt it to get an output that satisfies my requirement. Thank you again. Good day.
If you have the required SAS module licensed (and properly configured...) then have a look into the DQ functions like DQMATCH().
https://go.documentation.sas.com/doc/en/pgmsascdc/9.4_3.4/dqclref/p09nffezbjyj4on11oblz77aq1x6.htm
These DQ functions allow for example to tokenize a string (=split it up into its parts like company name and address components) or to create match codes which you then can use to detect similar names that are just spelled differently.
These functions use a knowledge base (set of data and rules) and though for example a lot of company names and spelling variations are already pre-defined to then result in the same match code. Match codes are like a cluster ID - and whatever is in the same cluster has a high probability to be the same even if spelled differently.
Thank you @Patrick . I was not aware of this function. I am definitely going to read, understand and test this for my learning and use it. Appreciate it.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.