BookmarkSubscribeRSS Feed
aaou
Obsidian | Level 7

Hi,

I have two datasets:

 ds1

Advisor
JP_Morgan
JP_Morgan_Chase___Co
JP_Morgan_Secs__Asia
JP_Morgan_Securities
Lehman_Brothers
Lehman_Brothers_Kuhn
Laidlaw___Co__UK__Lt
Laidlaw_Adams___Peck
Laidlaw_Ansbacher
Laidlaw_Equities_Inc
Macquarie_Capital__U
Macquarie_Capital_Ad
Macquarie_Capital_Gr
Macquarie_Capital_Pa

 

ds2

Advisor
JP_Morgan
Laidlaw
Macquarie_Capital

Lehman_Brothers

 

 

When I run the code below using soundex and compged ( I set a threshold level of 100 for the compged) I get the outcome given below the code: 

data advisors;
length possible_match $3;
set ds1;
tmp1=soundex(advisor);
possible_match='No';
do i=1 to nobs;
	set ds2(rename=(advisor=advisor2)) point=i nobs=nobs;
	tmp2=soundex(advisor2);
	dif=compged(tmp1,tmp2);
	if dif<=100 then do;
		possible_match='Yes';
		drop i tmp1 tmp2;
		output;
		end;
	end;
if possible_match='No' then do;
	call missing(advisor2);
	output;
	end;
run;

Output:

possible_matchAdvisoradvisor2dif
YesJP_MorganJP_Morgan0
NoJP_Morgan_Chase___Co 590
NoJP_Morgan_Secs__Asia 590
NoJP_Morgan_Securities 690
YesLehman_BrothersLehman_Brothers0
YesLehman_Brothers_KuhnLehman_Brothers70
NoLaidlaw___Co__UK__Lt 610
NoLaidlaw_Adams___Peck 600
NoLaidlaw_Ansbacher 520
NoLaidlaw_Equities_Inc 600
YesMacquarie_Capital__UMacquarie_Capital0
YesMacquarie_Capital_AdMacquarie_Capital50
YesMacquarie_Capital_GrMacquarie_Capital100
YesMacquarie_Capital_PaMacquarie_Capital50

 

But when I change the threshold score to a different value, say 500 then the compged scores (given by the diff column) changes entirely! Why is this? How can I prevent this from happening?

 

Also in the third column, rather than getting blanks ( if the compegd score is greater than the threshold) if I want to get the same name in the second column how should the code be amended?

 

Any help would be greatly appreciated! 🙂

 

Thanks!

 

 

3 REPLIES 3
Oligolas
Barite | Level 11

Hi aaou,

 

not the compged score is changing, only the records you output

and increasing the threshold as well increases the number of records you output.

 

In your proceeding, if you have a match in ds2, you output the dif, that's fine.

But if you have no match you always output the dif with 'Lehman_Brothers' of ds2, not sure if this dif is of interest.

 

Creating a cartesian product of ds1 and ds2 could be a good method to control the output.

 

Cheers

 

________________________

- Cheers -

aaou
Obsidian | Level 7

Hi Oligolas,

Thank you very much for showing me what's going on...

 

Will you be able to show me how I can change my code ? ( And no, the dif you have mentione is not needed).

 

Thank you!!

 

Best regards,

aaou

Oligolas
Barite | Level 11

Hm ok.. I split the tasks in several steps so you can control the output.

I added an order variable to keep your order along the tasks.

data ds1;
length Advisor $50;
input Advisor $;
datalines;
JP_Morgan
JP_Morgan_Chase___Co
JP_Morgan_Secs__Asia
JP_Morgan_Securities
Lehman_Brothers
Lehman_Brothers_Kuhn
Laidlaw___Co__UK__Lt
Laidlaw_Adams___Peck
Laidlaw_Ansbacher
Laidlaw_Equities_Inc
Macquarie_Capital__U
Macquarie_Capital_Ad
Macquarie_Capital_Gr
Macquarie_Capital_Pa
;
run;
data ds1;
   set ds1;
   order+1;
run;
 

data ds2;
length Advisor $50;
input Advisor $;
datalines;
JP_Morgan
Laidlaw
Macquarie_Capital
Lehman_Brothers
;
run;
data ds2;
   set ds2;
   order+1;
run;

PROC SQL;
   CREATE TABLE ds3 AS
      SELECT a.*,
              b.advisor AS advisor2
      FROM ds1 a,
           ds2 b
      ORDER BY a.order,b.order
   ;
QUIT;

data ds4;
   set ds3;
   length possible_match $3 advisor3 $50;
   retain FlagOutput .;
   by order;
   if first.order then FlagOutput=.;

   tmp1=soundex(advisor);
   tmp2=soundex(advisor2);
   dif=compged(tmp1,tmp2);
   possible_match='No';
   advisor3=advisor2;

   if dif<=100 then possible_match='Yes';
   else call missing(advisor3);

   if possible_match='Yes' then FlagOutput=1;
   else if FlagOutput eq 1 then FlagOutput=0;*case if one previous row matched;
   if last.order and missing(FlagOutput) then FlagOutput=1;

   drop tmp1 tmp2;
run;

data ds5;
   set ds4(drop=advisor2);
   rename advisor3=advisor2;
   if FlagOutput ne 1 or missing(advisor3) then dif=.;
   if FlagOutput eq 1 then output;
   drop FlagOutput order;
run;
________________________

- Cheers -

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 3 replies
  • 1100 views
  • 0 likes
  • 2 in conversation