DATA Step, Macro, Functions and more

Fuzzy match with soundex and compged

Reply
Contributor
Posts: 40

Fuzzy match with soundex and compged

[ Edited ]

Hi,

I have two datasets:

 ds1

Advisor
JP_Morgan
JP_Morgan_Chase___Co
JP_Morgan_Secs__Asia
JP_Morgan_Securities
Lehman_Brothers
Lehman_Brothers_Kuhn
Laidlaw___Co__UK__Lt
Laidlaw_Adams___Peck
Laidlaw_Ansbacher
Laidlaw_Equities_Inc
Macquarie_Capital__U
Macquarie_Capital_Ad
Macquarie_Capital_Gr
Macquarie_Capital_Pa

 

ds2

Advisor
JP_Morgan
Laidlaw
Macquarie_Capital

Lehman_Brothers

 

 

 

When I run the code below using soundex and compged ( I set a threshold level of 100 for the compged) I get the outcome given below the code: 

 

data advisors;
set ds1;
tmp1=soundex(advisor);
do i=1 to nobs;
set ds2(rename=(advisor=advisor2)) point=i nobs=nobs;
tmp2=soundex(advisor2);
dif=compged(tmp1,tmp2);
if dif<=100 then do;
possible_match='Yes';
drop i tmp1 tmp2;
output;
end;
end;
run;

 

Advisoradvisor2difpossible_match
JP_MorganJP_Morgan0Yes
Lehman_BrothersLehman_Brothers0Yes
Lehman_Brothers_KuhnLehman_Brothers70Yes
Macquarie_Capital__UMacquarie_Capital0Yes
Macquarie_Capital_AdMacquarie_Capital50Yes
Macquarie_Capital_GrMacquarie_Capital100Yes
Macquarie_Capital_PaMacquarie_Capital50

Yes

 

However, say I want an output table which gives me all the 14 observations in ds1, gives 'yes' for possible match and the respective match from ds2 in the second column if the compegd is below the threshold and gives 'no' for possible match and a blank for the second column if the threshold is surpassed. 

 

Basically I want to know how each observation in ds1 fairs with the matching and not to lose the observations because there wasn't a close match. 

 

Thank you!

Respected Advisor
Posts: 4,925

Re: Fuzzy match with soundex and compged

Add a bit more code:

 

data advisors;
length possible_match $3;
set ds1;
tmp1=soundex(advisor);
possible_match='No';
do i=1 to nobs;
	set ds2(rename=(advisor=advisor2)) point=i nobs=nobs;
	tmp2=soundex(advisor2);
	dif=compged(tmp1,tmp2);
	if dif<=100 then do;
		possible_match='Yes';
		drop i tmp1 tmp2;
		output;
		end;
	end;
if possible_match='No' then do;
	call missing(advisor2);
	output;
	end;
run;
PG
Contributor
Posts: 40

Re: Fuzzy match with soundex and compged

Thanks a lot!

 

I got a few questions:

 1) Suppose instead of returning blank values ( when the compegd score is greater than the threshold value) I want the original name in DS1 to appear instead of the blank value; how should the code be amended?

 

2) I noticed that when I change the threshold score from 100 to say 500, the compegd scores change completely....why is that?

 

3) what is the meaning of this part of the code

length possible_match $3;    

 

I'm new to coding so I truly appreciate your help! Thank you.

 

Respected Advisor
Posts: 4,925

Re: Fuzzy match with soundex and compged

1) Replace 

call missing(advisor2);

with

advisor2 = advisor; 

 

2) I don't know. You could submit this as a new question to the Forum.

 

3) Unless given explicitly, SAS determines the length of a character variable at the first assignment encountered during compilation. The length statement is to ensure that the variable possible_match isn't given a length of only two when the variable is first assigned the value "No".

PG
Contributor
Posts: 40

Re: Fuzzy match with soundex and compged

Thank you.

 

Regarding your answer to questions (1); I thought that would work too, but when I do that the output I get generates blanks for the first column of advisors and the table gets messed up. I have attached a screenshot of the output. 


sas.png
Respected Advisor
Posts: 4,925

Re: Fuzzy match with soundex and compged

Please post the new code.

PG
Contributor
Posts: 40

Re: Fuzzy match with soundex and compged

data advisors;
length possible_match $3;
set ds1;
tmp1=soundex(advisor);
possible_match='No';
do i=1 to nobs;
	set ds2(rename=(advisor=advisor2)) point=i nobs=nobs;
	tmp2=soundex(advisor2);
	dif=compged(tmp1,tmp2);
	if dif<=100 then do;
		possible_match='Yes';
		drop i tmp1 tmp2;
		output;
		end;
	end;
if possible_match='No' then do;
	call missing(advisor);
	output;
	end;
run;

 

I also tried this as I wasn't quite sure:

 

data advisors;
length possible_match $3;
set ds1;
tmp1=soundex(advisor);
possible_match='No';
do i=1 to nobs;
	set ds2(rename=(advisor=advisor2)) point=i nobs=nobs;
	tmp2=soundex(advisor2);
	dif=compged(tmp1,tmp2);
	if dif<=100 then do;
		possible_match='Yes';
		drop i tmp1 tmp2;
		output;
		end;
	end;
if possible_match='No' then do;
	call missing(advisor2=advisor);
	output;
	end;
run;

Both didn't return what I wanted. :-( 

Respected Advisor
Posts: 4,925

Re: Fuzzy match with soundex and compged

My suggestion should have led to:

 

data advisors;
length possible_match $3;
set ds1;
tmp1 = soundex(advisor);
possible_match = 'No';
do i = 1 to nobs;
	set ds2(rename = (advisor = advisor2)) point = i nobs = nobs;
	tmp2 = soundex(advisor2);
	dif = compged(tmp1, tmp2);
	if dif <= 100 then do;
		possible_match = 'Yes';
		output;
		end;
	end;
if possible_match = 'No' then do;
	advisor2 = advisor;
	output;
	end;
drop i tmp1 tmp2;
run;
PG
Ask a Question
Discussion stats
  • 7 replies
  • 379 views
  • 0 likes
  • 2 in conversation