BookmarkSubscribeRSS Feed
mitchell_keener
Obsidian | Level 7

Hi, 

 

I am using the sounds like operator to match two data sources. I noticed in my output that the soundex values returned for the matches do not necessary exactly equal one another. For example, the two soundex values 'B6622352235' and 'B66223' were considered a match. One thing I have noticed is that it seems that at least the first 4 digits of the matching soundex values are always the same. I cannot find any documentation anywhere which discusses how SAS is handling this. Does anyone know if SAS only attempts to match the first four characters when using sounds like operator or is something else going on here?  I included an example code, but I do not necessarily think it helps much.Thank you!

 

I am using SAS 9.4 TS1M6. 

 

-Mitch

PROC SQL;
CREATE TABLE Possible_Matches AS
SELECT *
FROM t1 AS n, t2 AS b
WHERE (n.name1 =* b.name2);
QUIT;

 

7 REPLIES 7
ChrisNZ
Tourmaline | Level 20

I don't think sounds like and soundex() are meant to handle numbers.

Even this fails:

where 'two' =* '2';

 

 

 

Reeza
Super User
If you're doing company matching, I'll usually pull my numbered companies out and then do the sounds like comparison or COMPGED which is usually more accurate.
PGStats
Opal | Level 21

I think you are right.

 

Little test:

 

data test;
length name1 name2 $12;
input name1 name2;
s1 = soundex(name1);
s2 = soundex(name2);
datalines;
bcd bcdgt
bcdg bcdgt
bcdg bcdgtv
bcdg bcdrtv
bcdg bcdgrv
;

proc sql;
select 
    *, 
    name1 =* name2 as SoundsLike,
    substr(soundex(name1),1,4) = substr(soundex(name2),1,4) as Soundex_4
from test;
quit;
   name1         name2         s1            s2            SoundsLike  Soundex_4
   -----------------------------------------------------------------------------
   bcd           bcdgt         B23           B2323                  0          0
   bcdg          bcdgt         B232          B2323                  1          1
   bcdg          bcdgtv        B232          B23231                 1          1
   bcdg          bcdrtv        B232          B23631                 0          0
   bcdg          bcdgrv        B232          B23261                 1          1
PG
Reeza
Super User
From the documentation of the algorithm:

The algorithm that is described in Knuth adds trailing zeros and truncates the result to the length of 4. You can perform these operations with other SAS functions.
mitchell_keener
Obsidian | Level 7

Hi Reeza, 

 

You are right that the algorithm truncates to 4 typically, but SAS does not follow this standard. It does not add trailing zeroes and allows the soundex lengths to be longer than 4. The statement you provided, along with my output, is what leads me to believe that it is likely matching the first 4 characters, but I can just not find anywhere in the documentation on using the sounds like operator that specifically states that it calculates past 4 characters in the soundex, but only matches on the first 4. 

 

This statement below is from a SUGI paper on the topic: 

 

"SOUNDEX function and Sounds-like operator in SAS
SAS system has included SOUNDEX function and sounds-like operator (=*) since version 6.07 in both DATA step and SQL
procedure (PROC SQL). But the SOUNDEX code generated by SAS is different from the standard American SOUNDEX code. The
differences are as follows.
1. SAS SOUNDEX function generates all the possible codes for the string. For example, Washington has a code of W25235
instead of W252. Even if there are spaces in the string, SOUNDEX function will generate the SOUNDEX code for the entire
string. So Hong Kong is coded as H52252. Notice here, 5 is for n, 2 is for g, the next 2 is coded because of the space between
the g and K, in this case, the Side-by-Side rule does not apply.
2. No additional zeros will be added. Jackson is coded as J25 instead of J250. There are no supplemental zeros being added to
make it a four-character code.
Sounds-like operator (=*) can be considered as the equal sign (=) between two SOUNDEX codes. For example, City1 =* City2 is the
same as SOUNDEX (City1) = SOUNDEX (City2). It has to be pointed out that when sounds-like operator is used in DATA step, it
can only be used in the WHERE statement."

 

https://support.sas.com/resources/papers/proceedings/proceedings/sugi29/072-29.pdf

mitchell_keener
Obsidian | Level 7

Hi PGStats, 

 

Thanks for your reply. Your test very similarly replicates what I am doing, so thanks for the extra evidence. I have to imagine that is what is happening here, but it would be nice if that was stated somewhere in the documentation. Thanks again for helping test. 

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 3688 views
  • 1 like
  • 4 in conversation