BookmarkSubscribeRSS Feed
Xusheng
Obsidian | Level 7

Hi everyone, I'm matching two files by company names with fuzzy matching method. The first data is from Compustat and the second is from the customer, which will be matched against the first one. Both data come with gvkey, I used the following code for fuzzy matching:

DATA MATCH;																/*Fuzzy matching based on 50 difference*/
LENGTH POSSIBLE_MATCH $3;
	SET COMP;
		TMP1 = SOUNDEX(conm);
		POSSIBLE_MATCH = 'No';
		DO i = 1 TO nobs;
	SET GVKEY(RENAME = (conm = conm2)) point = i nobs = nobs;
		TMP2 = SOUNDEX(conm2);
		DIF = COMPGED(TMP1, TMP2,'iL');
	IF DIF <= 50 THEN DO;
		POSSIBLE_MATCH = 'Yes';
		OUTPUT;
		END;
	END;
	IF POSSIBLE_MATCH = 'No' THEN DO;
		conm2 = conm;
		OUTPUT;
	END;
DROP i TMP1 TMP2;
RUN; 

where comp is first data, gvkey is the second one. 'conm' stands for the company name. 

However, when I check the result table the outcome is kind of "out of control". Here I screenshot the outcome of "yes" vs "no" as the illustration.

yes.pngno.png

As you may notice, the result from "yes" does not make sense to some extent whereas the "no" result has some acceptable outcome. 

Therefore, my questions are as following:

1. Is there any way to improve the outcome?

2. How to ignore the "LTD, INC", etc, inside the company name, does this method help the improving?

I appreciate any comment or suggestion. Thank you.

6 REPLIES 6
art297
Opal | Level 21

If you post your have, got and want datasets0, along with the code you've run, all in the form of datasteps using the {i} icon, you will improve your chances of anyone responding.

 

Art, CEO, AnalystFinder.com

Xusheng
Obsidian | Level 7

Hi art297, thank you for your reminding. I will provide the sample in the form that is similar to the sample in another fuzzy match post.

The code I import the data is:

 

PROC IMPORT OUT= WORK.COMP /*Import data from Compustat, the same code for gvkey data*/ 
DATAFILE= "E:\one_drive\........\Compustat.csv" 
DBMS=CSV REPLACE; 
GETNAMES=YES;
DATAROW=2; 
RUN;

The code for fuzzy matching is:

DATA MATCH;																/*Fuzzy matching based on 50 difference*/
LENGTH POSSIBLE_MATCH $3;
	SET COMP;
		TMP1 = SOUNDEX(conm);
		POSSIBLE_MATCH = 'No';
		DO i = 1 TO nobs;
	SET GVKEY(RENAME = (conm = conm2)) point = i nobs = nobs;
		TMP2 = SOUNDEX(conm2);
		DIF = COMPGED(TMP1, TMP2,'iL');
	IF DIF <= 50 THEN DO;
		POSSIBLE_MATCH = 'Yes';
		OUTPUT;
		END;
	END;
	IF POSSIBLE_MATCH = 'No' THEN DO;
		conm2 = conm;
		OUTPUT;
	END;
DROP i TMP1 TMP2;
RUN;

The second data I named as 'gvkey'  while the first one is 'comp'. 

I attached two sample data as well. The two sample I attached is simplified for matching purpose. 

ballardw
Super User

Many users here don't want to download Excel files because of virus potential, others have such things blocked by security software. Also if you give us Excel we have to create a SAS data set and due to the non-existent constraints on Excel data cells the result we end up with may not have variables of the same type (numeric or character) and even values.

 

Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the {i} icon or attached as text to show exactly what you have and that we can test code against.

Xusheng
Obsidian | Level 7

Thank you Ballardw, I will follow the instruction and use it then.

Patrick
Opal | Level 21

@Xusheng

...or attach .csv like in your Proc Import code and not .xlsx and no one will have security concerns downloading such attachments.

Xusheng
Obsidian | Level 7
Thank you for the information, Patrick, didn't know that before.

##- Please type your reply above this line. Simple formatting, no
attachments. -##

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 1455 views
  • 0 likes
  • 4 in conversation