Re: Compare two variables and find any partial match

chelm24 · Posted 06-04-2023 07:55 PM

Hello,

I require assistance with comparing two variables in SAS and determining if there are any partial record matches.

data have;
length VAR1 $100 VAR2 $100;
input VAR1 $ VAR2 $;
infile datalines dlm='|';
datalines;
1.DR. MORRISON|
1.MORRISON| ABCFG MORRISON
1.DR. MORRISON| MORRISON
1.DR. MORRISON| DR. WRIGHT
1. LA HOSPITAL| SAN DIEGO
;

data want / Expected Result

VAR1	VAR2	PARTIAL MATCH?
1.DR. MORRISON		NO
1.MORRISON	ABCFG MORRISON	YES
1.DR. MORRISON	MORRISON	YES
1.DR. MORRISON	DR.WRIGHT	NO
1. LA HOSPITAL	SAN DIEGO	NO

ballardw · Posted 06-04-2023 09:19 PM

Define your criteria for "partial match". 4 letters the same in sequence? 5 ? 6? Some other rule?

There are several SAS functions, COMPGED, SPEDIS and COMPLEV that will provide scores of spelling "distance", or a measure of similarity. I would try all three, and read the documentation, to select which seems to fit your data and need best. The lower the score returned the more similar two variables are.

data have;
length VAR1 $100 VAR2 $100;
input VAR1 $ VAR2 $;
infile datalines dlm='|';
Compgedscore = compged(var1, var2);
Complevscore = complev(var1, var2);
Spedisscore  = spedis(var1, var2);
datalines;
1.DR. MORRISON|
1.MORRISON| ABCFG MORRISON
1.DR. MORRISON| MORRISON
1.DR. MORRISON| DR. WRIGHT
1. LA HOSPITAL| SAN DIEGO
;

chelm24 · Posted 06-04-2023 09:51 PM

@ballardw , I need to determine the words that match between 2 variables and not by score. Partial match at least >= 4 letters the same in sequence.

VAR1	VAR2	PARTIAL MATCH?	MATCH
1.DR. MORRISON		NO
1.MORRISON	ABCFG MORRISON	YES	MORRISON
1.DR. MORRISON	MORRISON	YES	MORRISON
1.DR. MORRISON	DR.WRIGHT	NO
1. LA HOSPITAL	SAN DIEGO	NO

Patrick · Posted 06-05-2023 03:45 AM

@chelm24 wrote:

@ballardw , I need to determine the words that match between 2 variables and not by score. Partial match at least >= 4 letters the same in sequence.

Exact match of WORD OR exact match of any string of 4 characters within the same WORD.

Above two options should be doable BUT if you go for the option with 4 characters it could then be any two words as long as they share a sequence of 4 identical characters.

Tom · Posted 06-05-2023 08:33 AM

Just test each word.

data want ;
  set have;
  do i=1 to countw(var1,' ,.()-') until(found);
    word=scan(var1,i,' ,.()-');
    if length(word)>3 then found = 0<findw(var2,word,' ,.()-','it');
  end;
  if not found then do; 
     word=' ';
     i=0;
  end;
run;

Obs         VAR1         VAR2              i    found      word

 1     1.DR. MORRISON                      0      0
 2     1.MORRISON        ABCFG MORRISON    2      1      MORRISON
 3     1.DR. MORRISON    MORRISON          3      1      MORRISON
 4     1.DR. MORRISON    DR. WRIGHT        0      0
 5     1. LA HOSPITAL    SAN DIEGO         0      0

Ksharp · Posted 06-05-2023 07:35 AM

data have;
length VAR1 $100 VAR2 $100;
input VAR1 $ VAR2 $;
infile datalines dlm='|';

if find(compress(var1,,'ka'),compress(var2,,'ka'),'i') or 
   find(compress(var2,,'ka'),compress(var1,,'ka'),'i') then MATCH='Yes' ;
  else MATCH='No ' ;

datalines;
1.DR. MORRISON|
1.MORRISON| ABCFG MORRISON
1.DR. MORRISON| MORRISON
1.DR. MORRISON| DR. WRIGHT
1. LA HOSPITAL| SAN DIEGO
;

Compare two variables and find any partial match