SAS 9.4(TS1M8)
I've got lists of clinical laboratory tests from two different sources in datasets ListA and ListB. I need to evaluate each entry in ListA and find probable/possible/likely matches from ListB.
Here's a very simple example, one obs in each dataset.
dsn=ListA
varname=TestNameListA
COMPLETE CBC W/AUTO DIFF WBC
dsn=ListB
varname=TestNameListB
CBC with Differential
Visually and intuitively, I know these are both the same. How can SAS tell me these are the same or have a high probability of being the same?
I've tried SOUNDEX, but I'm sure what to do with the results.
word=soundex('COMPLETE CBC W/AUTO DIFF WBC');
put 'SoundEx1=' word;
word=soundex('CBC with Differential');
put 'SoundEx2=' word;
results:
SoundEx1=C5143212333112
SoundEx2=C123316534
I can see the substring "1233" is in common between the two, but does this tell me anything?
Is there a better function to use?
Thanks!
Any approach attempting to match strings of considerably different length is going to have some difficulty just because of the encoding.
Soundex for your problem may have another weekness. From the documention:
Note that the SOUNDEX algorithm is English-biased and is less useful for languages other than English.
While the language may be English abbreviations and likely not-common words appearing in clinical data may be treated by the algorithm as "not exactly English"
I might look to COMPGED and COMPLEV which basically return a numeric value related to spelling changes that is larger the more changes are necessary to get to the other string. There options to allow ignoring case in comparisons.
Smaller values are more likely to "match". So you can sort by the score returned and get a feel for how well any of this is doing.
Maybe:
proc sql; create table score as select a.TestNameListA, b.TestNameListB ,compged( a.TestNameListA, b.TestNameListB) as score from (select distinct TestNameListA from ListA) ,(select distinct TestNameListB from ListB) order by score ; quit;
Look in the options for Compged to see which may be applicable. Case insensitive may be likely with lots of acronyms.
You might be better off looking for Key values first though. You may have ideas what to look for. Other variables may also give you more information to use on even making the comparisons. I have a project where I have see if children from two sets are "matches". I have last name (usually), first name, date of birth and gender. So first I look for complete matches on all 4 variables. Then match lastname, date of birth and gender and compare the spelling of First name. That turns up differences like "Marie" and "Maria" or "Mark" and "Marc" as likely matches (low compged scores) but "Xavier" and "Robert" not so much.
Then after identifying those as matches go on to look at Firstname Lastname combinations (James Smyth vs Jim Smith) for example with matching dates of birth and gender.
Then look at matching dates of birth alone and name.
Then matching gender only.
But this only involves a couple 100 records. This is labor intensive.
Any approach attempting to match strings of considerably different length is going to have some difficulty just because of the encoding.
Soundex for your problem may have another weekness. From the documention:
Note that the SOUNDEX algorithm is English-biased and is less useful for languages other than English.
While the language may be English abbreviations and likely not-common words appearing in clinical data may be treated by the algorithm as "not exactly English"
I might look to COMPGED and COMPLEV which basically return a numeric value related to spelling changes that is larger the more changes are necessary to get to the other string. There options to allow ignoring case in comparisons.
Smaller values are more likely to "match". So you can sort by the score returned and get a feel for how well any of this is doing.
Maybe:
proc sql; create table score as select a.TestNameListA, b.TestNameListB ,compged( a.TestNameListA, b.TestNameListB) as score from (select distinct TestNameListA from ListA) ,(select distinct TestNameListB from ListB) order by score ; quit;
Look in the options for Compged to see which may be applicable. Case insensitive may be likely with lots of acronyms.
You might be better off looking for Key values first though. You may have ideas what to look for. Other variables may also give you more information to use on even making the comparisons. I have a project where I have see if children from two sets are "matches". I have last name (usually), first name, date of birth and gender. So first I look for complete matches on all 4 variables. Then match lastname, date of birth and gender and compare the spelling of First name. That turns up differences like "Marie" and "Maria" or "Mark" and "Marc" as likely matches (low compged scores) but "Xavier" and "Robert" not so much.
Then after identifying those as matches go on to look at Firstname Lastname combinations (James Smyth vs Jim Smith) for example with matching dates of birth and gender.
Then look at matching dates of birth alone and name.
Then matching gender only.
But this only involves a couple 100 records. This is labor intensive.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.