Hi,
I am using the sounds like operator to match two data sources. I noticed in my output that the soundex values returned for the matches do not necessary exactly equal one another. For example, the two soundex values 'B6622352235' and 'B66223' were considered a match. One thing I have noticed is that it seems that at least the first 4 digits of the matching soundex values are always the same. I cannot find any documentation anywhere which discusses how SAS is handling this. Does anyone know if SAS only attempts to match the first four characters when using sounds like operator or is something else going on here? I included an example code, but I do not necessarily think it helps much.Thank you!
I am using SAS 9.4 TS1M6.
-Mitch
PROC SQL;
CREATE TABLE Possible_Matches AS
SELECT *
FROM t1 AS n, t2 AS b
WHERE (n.name1 =* b.name2);
QUIT;
I don't think sounds like and soundex() are meant to handle numbers.
Even this fails:
where 'two' =* '2';
I think you are right.
Little test:
data test;
length name1 name2 $12;
input name1 name2;
s1 = soundex(name1);
s2 = soundex(name2);
datalines;
bcd bcdgt
bcdg bcdgt
bcdg bcdgtv
bcdg bcdrtv
bcdg bcdgrv
;
proc sql;
select
*,
name1 =* name2 as SoundsLike,
substr(soundex(name1),1,4) = substr(soundex(name2),1,4) as Soundex_4
from test;
quit;
name1 name2 s1 s2 SoundsLike Soundex_4 ----------------------------------------------------------------------------- bcd bcdgt B23 B2323 0 0 bcdg bcdgt B232 B2323 1 1 bcdg bcdgtv B232 B23231 1 1 bcdg bcdrtv B232 B23631 0 0 bcdg bcdgrv B232 B23261 1 1
Hi Reeza,
You are right that the algorithm truncates to 4 typically, but SAS does not follow this standard. It does not add trailing zeroes and allows the soundex lengths to be longer than 4. The statement you provided, along with my output, is what leads me to believe that it is likely matching the first 4 characters, but I can just not find anywhere in the documentation on using the sounds like operator that specifically states that it calculates past 4 characters in the soundex, but only matches on the first 4.
This statement below is from a SUGI paper on the topic:
"SOUNDEX function and Sounds-like operator in SAS
SAS system has included SOUNDEX function and sounds-like operator (=*) since version 6.07 in both DATA step and SQL
procedure (PROC SQL). But the SOUNDEX code generated by SAS is different from the standard American SOUNDEX code. The
differences are as follows.
1. SAS SOUNDEX function generates all the possible codes for the string. For example, Washington has a code of W25235
instead of W252. Even if there are spaces in the string, SOUNDEX function will generate the SOUNDEX code for the entire
string. So Hong Kong is coded as H52252. Notice here, 5 is for n, 2 is for g, the next 2 is coded because of the space between
the g and K, in this case, the Side-by-Side rule does not apply.
2. No additional zeros will be added. Jackson is coded as J25 instead of J250. There are no supplemental zeros being added to
make it a four-character code.
Sounds-like operator (=*) can be considered as the equal sign (=) between two SOUNDEX codes. For example, City1 =* City2 is the
same as SOUNDEX (City1) = SOUNDEX (City2). It has to be pointed out that when sounds-like operator is used in DATA step, it
can only be used in the WHERE statement."
https://support.sas.com/resources/papers/proceedings/proceedings/sugi29/072-29.pdf
Hi PGStats,
Thanks for your reply. Your test very similarly replicates what I am doing, so thanks for the extra evidence. I have to imagine that is what is happening here, but it would be nice if that was stated somewhere in the documentation. Thanks again for helping test.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.