Hi,
I have two datasets:
ds1
Advisor |
JP_Morgan |
JP_Morgan_Chase___Co |
JP_Morgan_Secs__Asia |
JP_Morgan_Securities |
Lehman_Brothers |
Lehman_Brothers_Kuhn |
Laidlaw___Co__UK__Lt |
Laidlaw_Adams___Peck |
Laidlaw_Ansbacher |
Laidlaw_Equities_Inc |
Macquarie_Capital__U |
Macquarie_Capital_Ad |
Macquarie_Capital_Gr |
Macquarie_Capital_Pa |
ds2
Advisor |
JP_Morgan |
Laidlaw |
Macquarie_Capital |
Lehman_Brothers |
When I run the code below using soundex and compged ( I set a threshold level of 100 for the compged) I get the outcome given below the code:
data advisors;
length possible_match $3;
set ds1;
tmp1=soundex(advisor);
possible_match='No';
do i=1 to nobs;
set ds2(rename=(advisor=advisor2)) point=i nobs=nobs;
tmp2=soundex(advisor2);
dif=compged(tmp1,tmp2);
if dif<=100 then do;
possible_match='Yes';
drop i tmp1 tmp2;
output;
end;
end;
if possible_match='No' then do;
call missing(advisor2);
output;
end;
run;
Output:
possible_match | Advisor | advisor2 | dif |
Yes | JP_Morgan | JP_Morgan | 0 |
No | JP_Morgan_Chase___Co | 590 | |
No | JP_Morgan_Secs__Asia | 590 | |
No | JP_Morgan_Securities | 690 | |
Yes | Lehman_Brothers | Lehman_Brothers | 0 |
Yes | Lehman_Brothers_Kuhn | Lehman_Brothers | 70 |
No | Laidlaw___Co__UK__Lt | 610 | |
No | Laidlaw_Adams___Peck | 600 | |
No | Laidlaw_Ansbacher | 520 | |
No | Laidlaw_Equities_Inc | 600 | |
Yes | Macquarie_Capital__U | Macquarie_Capital | 0 |
Yes | Macquarie_Capital_Ad | Macquarie_Capital | 50 |
Yes | Macquarie_Capital_Gr | Macquarie_Capital | 100 |
Yes | Macquarie_Capital_Pa | Macquarie_Capital | 50 |
But when I change the threshold score to a different value, say 500 then the compged scores (given by the diff column) changes entirely! Why is this? How can I prevent this from happening?
Also in the third column, rather than getting blanks ( if the compegd score is greater than the threshold) if I want to get the same name in the second column how should the code be amended?
Any help would be greatly appreciated! 🙂
Thanks!
Hi aaou,
not the compged score is changing, only the records you output
and increasing the threshold as well increases the number of records you output.
In your proceeding, if you have a match in ds2, you output the dif, that's fine.
But if you have no match you always output the dif with 'Lehman_Brothers' of ds2, not sure if this dif is of interest.
Creating a cartesian product of ds1 and ds2 could be a good method to control the output.
Cheers
- Cheers -
Hi Oligolas,
Thank you very much for showing me what's going on...
Will you be able to show me how I can change my code ? ( And no, the dif you have mentione is not needed).
Thank you!!
Best regards,
aaou
Hm ok.. I split the tasks in several steps so you can control the output.
I added an order variable to keep your order along the tasks.
data ds1;
length Advisor $50;
input Advisor $;
datalines;
JP_Morgan
JP_Morgan_Chase___Co
JP_Morgan_Secs__Asia
JP_Morgan_Securities
Lehman_Brothers
Lehman_Brothers_Kuhn
Laidlaw___Co__UK__Lt
Laidlaw_Adams___Peck
Laidlaw_Ansbacher
Laidlaw_Equities_Inc
Macquarie_Capital__U
Macquarie_Capital_Ad
Macquarie_Capital_Gr
Macquarie_Capital_Pa
;
run;
data ds1;
set ds1;
order+1;
run;
data ds2;
length Advisor $50;
input Advisor $;
datalines;
JP_Morgan
Laidlaw
Macquarie_Capital
Lehman_Brothers
;
run;
data ds2;
set ds2;
order+1;
run;
PROC SQL;
CREATE TABLE ds3 AS
SELECT a.*,
b.advisor AS advisor2
FROM ds1 a,
ds2 b
ORDER BY a.order,b.order
;
QUIT;
data ds4;
set ds3;
length possible_match $3 advisor3 $50;
retain FlagOutput .;
by order;
if first.order then FlagOutput=.;
tmp1=soundex(advisor);
tmp2=soundex(advisor2);
dif=compged(tmp1,tmp2);
possible_match='No';
advisor3=advisor2;
if dif<=100 then possible_match='Yes';
else call missing(advisor3);
if possible_match='Yes' then FlagOutput=1;
else if FlagOutput eq 1 then FlagOutput=0;*case if one previous row matched;
if last.order and missing(FlagOutput) then FlagOutput=1;
drop tmp1 tmp2;
run;
data ds5;
set ds4(drop=advisor2);
rename advisor3=advisor2;
if FlagOutput ne 1 or missing(advisor3) then dif=.;
if FlagOutput eq 1 then output;
drop FlagOutput order;
run;
- Cheers -
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.