Solved: Match

Anuz · Posted 07-06-2021 11:24 AM

Hi All,

Looking for best options to match values in two tables

I have one dataset that has the below data that has company names

data COMPANY_DATA;

input ENUM $ COMPANY_NAME $ ;

datalines;

D1234 COMPAX LTD

D1256 TEST NOTE LTD

D1345 LISTKK ENTERPRISES

D2234 ZIVOKA

D4534 LIBORD NUKA PVT

D7887 ZIMZUM COLLEGE

;

run;

I have another two datasets that has possible first names and second names

data firstnames;

fname $;

peter

sam

zivo

zimzumtu

run;

data secondnames;

sname $;

peterson

kane

bargh

anderson

run;

The first names and second names dataset is fairly.

What I want to highlight for each row in the COMPANY_DATA dataset is if there is rough match to a primary and/or second name in the company name partially or completely. For example there could be a primary name say mark and the company name has Marke in it..it should be a possible match but if the company name has Market Research then that is not a possible match. So basically as close to the primary name listed in the firstnames dataset.

What is the best option to achieve the above ? Thank you in advance.

Amir · Posted 07-07-2021 06:25 PM

Due to lack of time, this is not a full solution and there are probably better ways, but I amended the input data to reflect the company names you showed in your output and added an extra row to secondnames to make sure the matching process worked, as the original data in secondnames had no matches.

Extra processing will be required to remove the extra rows that are generated by the proc sql, but this should give you some ideas on how your data might be processed:

data COMPANY_DATA;
	input ENUM $5. COMPANY_NAME $25.;

	datalines;
D1234 COMPAX LTD
D1256 TEST NOTE LTD
D1345 LISTKK ENTERPRISES
D2234 ZIVOKA
D4534 NUKA ALEX PVT
D7887 ZIMZUM COLLEGE
;


data firstnames;
	input fname $;

	datalines;
peter
sam
zivoka
zimzum
alex
;


data secondnames;
	input sname $;

	datalines;
peterson
kane
bargh
anderson
note
;


proc sql noprint;
	create table
		combined
	as
	select distinct
		 c.*
		,ifc(find(company_name,fname,'it'),'Y','N') as fname_flag
		,find(company_name,fname,'it')              as fname_pos
		,ifc(find(company_name,sname,'it'),'Y','N') as sname_flag
		,find(company_name,sname,'it')              as sname_pos
		
	from
		 company_data c
		,firstnames
		,secondnames
	;
quit;

Thanks & kind regards,

Amir.

View solution in original post

Anuz · Posted 07-06-2021 01:11 PM

let me know if you need more clarity on my query. or if I need to explain it better

Amir · Posted 07-06-2021 02:31 PM

Hi @Anuz,

Thanks for supplying the data in data steps. For the benefit of all I've tried to make the code more readable and some of the steps needed tweaking (see further below).

Please supply another data step with datalines showing the resulting data you want to see in the output, based on the input data you have given.

data COMPANY_DATA;
	input ENUM $ COMPANY_NAME $ ;

	datalines;
D1234 COMPAX LTD
D1256 TEST NOTE LTD
D1345 LISTKK ENTERPRISES
D2234 ZIVOKA
D4534 LIBORD NUKA PVT
D7887 ZIMZUM COLLEGE
;


data firstnames;
	input fname $;

	datalines;
peter
sam
zivo
zimzumtu
;


data secondnames;
	input sname $;

	datalines;
peterson
kane
bargh
anderson
;

Thanks & kind regards,

Amir.

Anuz · Posted 07-06-2021 02:58 PM

Thank you @Amir

Something like the below as output would be good as it will be useful for the end user to understand the analysis. However I am open to any other suggestion of output.

this is on the assumption that in the firstnames dataset - zivoka , alex and zimzum are listed as accepted first names.

Hope this helps.

data COMPANY_DATA;
	input ENUM $5. COMPANY_NAME $25. FNAME_FLAG $1. FNAME_POS 8.;

	datalines;
D1234 COMPAX LTD              N   0 
D1256 TEST NOTE LTD           N   0
D1345 LISTKK ENTERPRISES      N   0 
D2234 ZIVOKA                  Y   1
D4534 NUKA ALEX PVT           Y   5  
D7887 ZIMZUM COLLEGE          Y   1
;

run;

Amir · Posted 07-06-2021 05:26 PM

You mention that:

this is on the assumption that in the firstnames dataset - zivoka , alex and zimzum are listed as accepted first names.

Does "accepted first names" mean the names on the firstnames data set? If not, can you please supply another data set with the accepted names or precisely define the rules of what is accepted.
"alex" does not appear in the input data you supplied for the firstnames data set, should it be on the input data set firstnames?
The closest I can see to "zivoka" on the firstnames data set is "zivo", does this mean the fname can be shorter than the name it matches with?
The closest I can see to "zimzum" on the firstnames data set is "zimzumtu", does this mean the fname can be longer than the name it matches with?
What is the maximum length the names can be different by?
Can the difference only be at the end of the name or can it be at the beginning or up to a specific number of characters anywhere in the fname?
If the input data in the firstnames data set needs to be corrected then please edit your original question and make the correction.

Thanks & kind regards,

Amir.

Anuz · Posted 07-06-2021 05:40 PM

data firstnames;
	input fname $;

	datalines;
peter
sam
zivoka
zimzum
alex
;


Hi @Amir . Please take the above for the firstnames dataset. 

You are right, accepted first names means the names in the firstnames dataset. 

The values in the firstnames dataset if found anywhere in the company dataset then that needs to be highlighted/identified by the code.
So for example if the company name is 
a. PoundAlex Ltd or 
b. AlexPound Ltd or 
c. Alex Ltd or 
d. Boxing Alex Ltd 
 in all the both the cases Alex needs to be identified. 

Hope I have managed to clarify all the questions you raised. Thank you

Amir · Posted 07-07-2021 06:25 PM

Due to lack of time, this is not a full solution and there are probably better ways, but I amended the input data to reflect the company names you showed in your output and added an extra row to secondnames to make sure the matching process worked, as the original data in secondnames had no matches.

Extra processing will be required to remove the extra rows that are generated by the proc sql, but this should give you some ideas on how your data might be processed:

data COMPANY_DATA;
	input ENUM $5. COMPANY_NAME $25.;

	datalines;
D1234 COMPAX LTD
D1256 TEST NOTE LTD
D1345 LISTKK ENTERPRISES
D2234 ZIVOKA
D4534 NUKA ALEX PVT
D7887 ZIMZUM COLLEGE
;


data firstnames;
	input fname $;

	datalines;
peter
sam
zivoka
zimzum
alex
;


data secondnames;
	input sname $;

	datalines;
peterson
kane
bargh
anderson
note
;


proc sql noprint;
	create table
		combined
	as
	select distinct
		 c.*
		,ifc(find(company_name,fname,'it'),'Y','N') as fname_flag
		,find(company_name,fname,'it')              as fname_pos
		,ifc(find(company_name,sname,'it'),'Y','N') as sname_flag
		,find(company_name,sname,'it')              as sname_pos
		
	from
		 company_data c
		,firstnames
		,secondnames
	;
quit;

Thanks & kind regards,

Amir.

Anuz · Posted 07-08-2021 08:02 AM

Thank you @Amir .

Appreciate your time and patience. This works for me. I was able to use your method and adapt it to get an output that satisfies my requirement. Thank you again. Good day.

Patrick · Posted 07-08-2021 03:33 AM

If you have the required SAS module licensed (and properly configured...) then have a look into the DQ functions like DQMATCH().

https://go.documentation.sas.com/doc/en/pgmsascdc/9.4_3.4/dqclref/p09nffezbjyj4on11oblz77aq1x6.htm

These DQ functions allow for example to tokenize a string (=split it up into its parts like company name and address components) or to create match codes which you then can use to detect similar names that are just spelled differently.

These functions use a knowledge base (set of data and rules) and though for example a lot of company names and spelling variations are already pre-defined to then result in the same match code. Match codes are like a cluster ID - and whatever is in the same cluster has a high probability to be the same even if spelled differently.

Anuz · Posted 07-08-2021 08:04 AM

Thank you @Patrick . I was not aware of this function. I am definitely going to read, understand and test this for my learning and use it. Appreciate it.

Match

Re: Match

Re: Match

Re: Match

Re: Match

Re: Match

Re: Match

Re: Match

Re: Match

Re: Match

Re: Match

SAS Innovate 2025: Call for Content

Classroom Training Available!