BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Doug_in_STL
Fluorite | Level 6

SAS 9.4(TS1M8)

 

I've got lists of clinical laboratory tests from two different sources in datasets ListA and ListB. I need to evaluate each entry in ListA and find probable/possible/likely matches from ListB.

 

Here's a very simple example, one obs in each dataset.

dsn=ListA

varname=TestNameListA

COMPLETE CBC W/AUTO DIFF WBC

 

dsn=ListB

varname=TestNameListB

CBC with Differential

 

Visually and intuitively, I know these are both the same. How can SAS tell me these are the same or have a high probability of being the same?

 

I've tried SOUNDEX, but I'm sure what to do with the results.

word=soundex('COMPLETE CBC W/AUTO DIFF WBC');
put 'SoundEx1=' word;
word=soundex('CBC with Differential');
put 'SoundEx2=' word;

 

results:

SoundEx1=C5143212333112
SoundEx2=C123316534

 

I can see the substring "1233" is in common between the two, but does this tell me anything?

 

Is there a better function to use?

 

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

Any approach attempting to match strings of considerably different length is going to have some difficulty just because of the encoding.

Soundex for your problem may have another weekness. From the documention:

Note that the SOUNDEX algorithm is English-biased and is less useful for languages other than English.

While the language may be English abbreviations and likely not-common words appearing in clinical data may be treated by the algorithm as "not exactly English"

 

I might look to COMPGED and COMPLEV which basically return a numeric value related to spelling changes that is larger the more changes are necessary to get to the other string. There options to allow ignoring case in comparisons.

Smaller values are more likely to "match". So you can sort by the score returned and get a feel for how well any of this is doing.

Maybe:

proc sql;
   create table score as
   select a.TestNameListA, b.TestNameListB
         ,compged( a.TestNameListA, b.TestNameListB) as score
   from (select distinct TestNameListA from ListA)
       ,(select distinct TestNameListB from ListB)
   order by score
   ;
quit;

Look in the options for Compged to see which may be applicable. Case insensitive may be likely with lots of acronyms.

 

You might be better off looking for Key values first though. You may have ideas what to look for.  Other variables may also give you more information to use on even making the comparisons. I have a project where I have see if children from two sets are "matches". I have last name (usually), first name, date of birth and gender. So first I look for complete matches on all 4 variables. Then match lastname, date of birth and gender and compare the spelling of First name.  That turns up differences like "Marie" and "Maria" or "Mark" and "Marc" as likely  matches (low compged scores) but "Xavier" and "Robert" not so much.

Then after identifying those as matches go on to look at Firstname Lastname combinations (James Smyth vs Jim Smith) for example with matching dates of birth and gender.

Then look at matching dates of birth alone and name.

Then matching gender only.

 

But this only involves a couple 100 records. This is labor intensive.

 

View solution in original post

2 REPLIES 2
ballardw
Super User

Any approach attempting to match strings of considerably different length is going to have some difficulty just because of the encoding.

Soundex for your problem may have another weekness. From the documention:

Note that the SOUNDEX algorithm is English-biased and is less useful for languages other than English.

While the language may be English abbreviations and likely not-common words appearing in clinical data may be treated by the algorithm as "not exactly English"

 

I might look to COMPGED and COMPLEV which basically return a numeric value related to spelling changes that is larger the more changes are necessary to get to the other string. There options to allow ignoring case in comparisons.

Smaller values are more likely to "match". So you can sort by the score returned and get a feel for how well any of this is doing.

Maybe:

proc sql;
   create table score as
   select a.TestNameListA, b.TestNameListB
         ,compged( a.TestNameListA, b.TestNameListB) as score
   from (select distinct TestNameListA from ListA)
       ,(select distinct TestNameListB from ListB)
   order by score
   ;
quit;

Look in the options for Compged to see which may be applicable. Case insensitive may be likely with lots of acronyms.

 

You might be better off looking for Key values first though. You may have ideas what to look for.  Other variables may also give you more information to use on even making the comparisons. I have a project where I have see if children from two sets are "matches". I have last name (usually), first name, date of birth and gender. So first I look for complete matches on all 4 variables. Then match lastname, date of birth and gender and compare the spelling of First name.  That turns up differences like "Marie" and "Maria" or "Mark" and "Marc" as likely  matches (low compged scores) but "Xavier" and "Robert" not so much.

Then after identifying those as matches go on to look at Firstname Lastname combinations (James Smyth vs Jim Smith) for example with matching dates of birth and gender.

Then look at matching dates of birth alone and name.

Then matching gender only.

 

But this only involves a couple 100 records. This is labor intensive.

 

Doug_in_STL
Fluorite | Level 6
Thanks, you've given me some good ideas.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 2 replies
  • 9010 views
  • 1 like
  • 2 in conversation