Solved: Find possible matches between two lists: SOUNDEX or something else?

Doug_in_STL · Posted 08-01-2023 01:06 PM

SAS 9.4(TS1M8)

I've got lists of clinical laboratory tests from two different sources in datasets ListA and ListB. I need to evaluate each entry in ListA and find probable/possible/likely matches from ListB.

Here's a very simple example, one obs in each dataset.

dsn=ListA

varname=TestNameListA

COMPLETE CBC W/AUTO DIFF WBC

dsn=ListB

varname=TestNameListB

CBC with Differential

Visually and intuitively, I know these are both the same. How can SAS tell me these are the same or have a high probability of being the same?

I've tried SOUNDEX, but I'm sure what to do with the results.

word=soundex('COMPLETE CBC W/AUTO DIFF WBC');
put 'SoundEx1=' word;
word=soundex('CBC with Differential');
put 'SoundEx2=' word;

results:

SoundEx1=C5143212333112
SoundEx2=C123316534

I can see the substring "1233" is in common between the two, but does this tell me anything?

Is there a better function to use?

Thanks!

ballardw · Posted 08-01-2023 03:38 PM

Any approach attempting to match strings of considerably different length is going to have some difficulty just because of the encoding.

Soundex for your problem may have another weekness. From the documention:

Note that the SOUNDEX algorithm is English-biased and is less useful for languages other than English.

While the language may be English abbreviations and likely not-common words appearing in clinical data may be treated by the algorithm as "not exactly English"

I might look to COMPGED and COMPLEV which basically return a numeric value related to spelling changes that is larger the more changes are necessary to get to the other string. There options to allow ignoring case in comparisons.

Smaller values are more likely to "match". So you can sort by the score returned and get a feel for how well any of this is doing.

Maybe:

proc sql;
   create table score as
   select a.TestNameListA, b.TestNameListB
         ,compged( a.TestNameListA, b.TestNameListB) as score
   from (select distinct TestNameListA from ListA)
       ,(select distinct TestNameListB from ListB)
   order by score
   ;
quit;

Look in the options for Compged to see which may be applicable. Case insensitive may be likely with lots of acronyms.

You might be better off looking for Key values first though. You may have ideas what to look for. Other variables may also give you more information to use on even making the comparisons. I have a project where I have see if children from two sets are "matches". I have last name (usually), first name, date of birth and gender. So first I look for complete matches on all 4 variables. Then match lastname, date of birth and gender and compare the spelling of First name. That turns up differences like "Marie" and "Maria" or "Mark" and "Marc" as likely matches (low compged scores) but "Xavier" and "Robert" not so much.

Then after identifying those as matches go on to look at Firstname Lastname combinations (James Smyth vs Jim Smith) for example with matching dates of birth and gender.

Then look at matching dates of birth alone and name.

Then matching gender only.

But this only involves a couple 100 records. This is labor intensive.

View solution in original post

ballardw · Posted 08-01-2023 03:38 PM

Any approach attempting to match strings of considerably different length is going to have some difficulty just because of the encoding.

Soundex for your problem may have another weekness. From the documention:

Note that the SOUNDEX algorithm is English-biased and is less useful for languages other than English.

While the language may be English abbreviations and likely not-common words appearing in clinical data may be treated by the algorithm as "not exactly English"

I might look to COMPGED and COMPLEV which basically return a numeric value related to spelling changes that is larger the more changes are necessary to get to the other string. There options to allow ignoring case in comparisons.

Smaller values are more likely to "match". So you can sort by the score returned and get a feel for how well any of this is doing.

Maybe:

proc sql;
   create table score as
   select a.TestNameListA, b.TestNameListB
         ,compged( a.TestNameListA, b.TestNameListB) as score
   from (select distinct TestNameListA from ListA)
       ,(select distinct TestNameListB from ListB)
   order by score
   ;
quit;

Look in the options for Compged to see which may be applicable. Case insensitive may be likely with lots of acronyms.

You might be better off looking for Key values first though. You may have ideas what to look for. Other variables may also give you more information to use on even making the comparisons. I have a project where I have see if children from two sets are "matches". I have last name (usually), first name, date of birth and gender. So first I look for complete matches on all 4 variables. Then match lastname, date of birth and gender and compare the spelling of First name. That turns up differences like "Marie" and "Maria" or "Mark" and "Marc" as likely matches (low compged scores) but "Xavier" and "Robert" not so much.

Then after identifying those as matches go on to look at Firstname Lastname combinations (James Smyth vs Jim Smith) for example with matching dates of birth and gender.

Then look at matching dates of birth alone and name.

Then matching gender only.

But this only involves a couple 100 records. This is labor intensive.

Doug_in_STL · Posted 08-02-2023 08:05 AM

Thanks, you've given me some good ideas.

Find possible matches between two lists: SOUNDEX or something else?

Re: Find possible matches between two lists: SOUNDEX or something else?

Re: Find possible matches between two lists: SOUNDEX or something else?

Re: Find possible matches between two lists: SOUNDEX or something else?

Find possible matches between two lists: SOUNDEX or something else?

Re: Find possible matches between two lists: SOUNDEX or something else?

Re: Find possible matches between two lists: SOUNDEX or something else?

Re: Find possible matches between two lists: SOUNDEX or something else?

Registration is open

SAS Training: Just a Click Away