Hi,
I have two datasets:
ds1
Advisor |
JP_Morgan |
JP_Morgan_Chase___Co |
JP_Morgan_Secs__Asia |
JP_Morgan_Securities |
Lehman_Brothers |
Lehman_Brothers_Kuhn |
Laidlaw___Co__UK__Lt |
Laidlaw_Adams___Peck |
Laidlaw_Ansbacher |
Laidlaw_Equities_Inc |
Macquarie_Capital__U |
Macquarie_Capital_Ad |
Macquarie_Capital_Gr |
Macquarie_Capital_Pa |
ds2
Advisor |
JP_Morgan |
Laidlaw |
Macquarie_Capital |
Lehman_Brothers |
When I run the code below using soundex and compged ( I set a threshold level of 100 for the compged) I get the outcome given below the code:
data advisors;
length possible_match $3;
set ds1;
tmp1=soundex(advisor);
possible_match='No';
do i=1 to nobs;
set ds2(rename=(advisor=advisor2)) point=i nobs=nobs;
tmp2=soundex(advisor2);
dif=compged(tmp1,tmp2);
if dif<=100 then do;
possible_match='Yes';
drop i tmp1 tmp2;
output;
end;
end;
if possible_match='No' then do;
call missing(advisor2);
output;
end;
run;
Output:
possible_match | Advisor | advisor2 | dif |
Yes | JP_Morgan | JP_Morgan | 0 |
No | JP_Morgan_Chase___Co | 590 | |
No | JP_Morgan_Secs__Asia | 590 | |
No | JP_Morgan_Securities | 690 | |
Yes | Lehman_Brothers | Lehman_Brothers | 0 |
Yes | Lehman_Brothers_Kuhn | Lehman_Brothers | 70 |
No | Laidlaw___Co__UK__Lt | 610 | |
No | Laidlaw_Adams___Peck | 600 | |
No | Laidlaw_Ansbacher | 520 | |
No | Laidlaw_Equities_Inc | 600 | |
Yes | Macquarie_Capital__U | Macquarie_Capital | 0 |
Yes | Macquarie_Capital_Ad | Macquarie_Capital | 50 |
Yes | Macquarie_Capital_Gr | Macquarie_Capital | 100 |
Yes | Macquarie_Capital_Pa | Macquarie_Capital | 50 |
But when I change the threshold score to a different value, say 500 then the compged scores (given by the diff column) changes entirely! Why is this? How can I prevent this from happening?
Also in the third column, rather than getting blanks ( if the compegd score is greater than the threshold) if I want to get the same name in the second column how should the code be amended?
Any help would be greatly appreciated! 🙂
Thanks!
Hi aaou,
not the compged score is changing, only the records you output
and increasing the threshold as well increases the number of records you output.
In your proceeding, if you have a match in ds2, you output the dif, that's fine.
But if you have no match you always output the dif with 'Lehman_Brothers' of ds2, not sure if this dif is of interest.
Creating a cartesian product of ds1 and ds2 could be a good method to control the output.
Cheers
- Cheers -
Hi Oligolas,
Thank you very much for showing me what's going on...
Will you be able to show me how I can change my code ? ( And no, the dif you have mentione is not needed).
Thank you!!
Best regards,
aaou
Hm ok.. I split the tasks in several steps so you can control the output.
I added an order variable to keep your order along the tasks.
data ds1;
length Advisor $50;
input Advisor $;
datalines;
JP_Morgan
JP_Morgan_Chase___Co
JP_Morgan_Secs__Asia
JP_Morgan_Securities
Lehman_Brothers
Lehman_Brothers_Kuhn
Laidlaw___Co__UK__Lt
Laidlaw_Adams___Peck
Laidlaw_Ansbacher
Laidlaw_Equities_Inc
Macquarie_Capital__U
Macquarie_Capital_Ad
Macquarie_Capital_Gr
Macquarie_Capital_Pa
;
run;
data ds1;
set ds1;
order+1;
run;
data ds2;
length Advisor $50;
input Advisor $;
datalines;
JP_Morgan
Laidlaw
Macquarie_Capital
Lehman_Brothers
;
run;
data ds2;
set ds2;
order+1;
run;
PROC SQL;
CREATE TABLE ds3 AS
SELECT a.*,
b.advisor AS advisor2
FROM ds1 a,
ds2 b
ORDER BY a.order,b.order
;
QUIT;
data ds4;
set ds3;
length possible_match $3 advisor3 $50;
retain FlagOutput .;
by order;
if first.order then FlagOutput=.;
tmp1=soundex(advisor);
tmp2=soundex(advisor2);
dif=compged(tmp1,tmp2);
possible_match='No';
advisor3=advisor2;
if dif<=100 then possible_match='Yes';
else call missing(advisor3);
if possible_match='Yes' then FlagOutput=1;
else if FlagOutput eq 1 then FlagOutput=0;*case if one previous row matched;
if last.order and missing(FlagOutput) then FlagOutput=1;
drop tmp1 tmp2;
run;
data ds5;
set ds4(drop=advisor2);
rename advisor3=advisor2;
if FlagOutput ne 1 or missing(advisor3) then dif=.;
if FlagOutput eq 1 then output;
drop FlagOutput order;
run;
- Cheers -
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.