SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

Improving fuzzy match quality

Reply
Contributor
Posts: 49

Improving fuzzy match quality

Hi everyone, I'm matching two files by company names with fuzzy matching method. The first data is from Compustat and the second is from the customer, which will be matched against the first one. Both data come with gvkey, I used the following code for fuzzy matching:

DATA MATCH;																/*Fuzzy matching based on 50 difference*/
LENGTH POSSIBLE_MATCH $3;
	SET COMP;
		TMP1 = SOUNDEX(conm);
		POSSIBLE_MATCH = 'No';
		DO i = 1 TO nobs;
	SET GVKEY(RENAME = (conm = conm2)) point = i nobs = nobs;
		TMP2 = SOUNDEX(conm2);
		DIF = COMPGED(TMP1, TMP2,'iL');
	IF DIF <= 50 THEN DO;
		POSSIBLE_MATCH = 'Yes';
		OUTPUT;
		END;
	END;
	IF POSSIBLE_MATCH = 'No' THEN DO;
		conm2 = conm;
		OUTPUT;
	END;
DROP i TMP1 TMP2;
RUN; 

where comp is first data, gvkey is the second one. 'conm' stands for the company name. 

However, when I check the result table the outcome is kind of "out of control". Here I screenshot the outcome of "yes" vs "no" as the illustration.

yes.pngno.png

As you may notice, the result from "yes" does not make sense to some extent whereas the "no" result has some acceptable outcome. 

Therefore, my questions are as following:

1. Is there any way to improve the outcome?

2. How to ignore the "LTD, INC", etc, inside the company name, does this method help the improving?

I appreciate any comment or suggestion. Thank you.

PROC Star
Posts: 7,471

Re: Improving fuzzy match quality

If you post your have, got and want datasets0, along with the code you've run, all in the form of datasteps using the {i} icon, you will improve your chances of anyone responding.

 

Art, CEO, AnalystFinder.com

Contributor
Posts: 49

Re: Improving fuzzy match quality

Hi art297, thank you for your reminding. I will provide the sample in the form that is similar to the sample in another fuzzy match post.

The code I import the data is:

 

PROC IMPORT OUT= WORK.COMP /*Import data from Compustat, the same code for gvkey data*/ 
DATAFILE= "E:\one_drive\........\Compustat.csv" 
DBMS=CSV REPLACE; 
GETNAMES=YES;
DATAROW=2; 
RUN;

The code for fuzzy matching is:

DATA MATCH;																/*Fuzzy matching based on 50 difference*/
LENGTH POSSIBLE_MATCH $3;
	SET COMP;
		TMP1 = SOUNDEX(conm);
		POSSIBLE_MATCH = 'No';
		DO i = 1 TO nobs;
	SET GVKEY(RENAME = (conm = conm2)) point = i nobs = nobs;
		TMP2 = SOUNDEX(conm2);
		DIF = COMPGED(TMP1, TMP2,'iL');
	IF DIF <= 50 THEN DO;
		POSSIBLE_MATCH = 'Yes';
		OUTPUT;
		END;
	END;
	IF POSSIBLE_MATCH = 'No' THEN DO;
		conm2 = conm;
		OUTPUT;
	END;
DROP i TMP1 TMP2;
RUN;

The second data I named as 'gvkey'  while the first one is 'comp'. 

I attached two sample data as well. The two sample I attached is simplified for matching purpose. 

Super User
Posts: 11,343

Re: Improving fuzzy match quality

Many users here don't want to download Excel files because of virus potential, others have such things blocked by security software. Also if you give us Excel we have to create a SAS data set and due to the non-existent constraints on Excel data cells the result we end up with may not have variables of the same type (numeric or character) and even values.

 

Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the {i} icon or attached as text to show exactly what you have and that we can test code against.

Contributor
Posts: 49

Re: Improving fuzzy match quality

Thank you Ballardw, I will follow the instruction and use it then.

Respected Advisor
Posts: 4,173

Re: Improving fuzzy match quality

[ Edited ]

@Xusheng

...or attach .csv like in your Proc Import code and not .xlsx and no one will have security concerns downloading such attachments.

Contributor
Posts: 49

Re: Improving fuzzy match quality

Thank you for the information, Patrick, didn't know that before.

##- Please type your reply above this line. Simple formatting, no
attachments. -##
Ask a Question
Discussion stats
  • 6 replies
  • 177 views
  • 0 likes
  • 4 in conversation