Re: Matching/Removing the exact duplicate word using FIND(W) function

keen_sas · Posted 01-07-2020 11:59 PM

I have duplicate words in a string which mimics other words with a minor differences in characters. But while identifying them using FIND/FINDW function it is considering both as same words and removing them.For example JAPAN and JAPANESE are both distinct words present in the same string , when trying to identify them using FIND/FINDW function both considered as same word and deleting one of them, but ideally both of them are different words. Same for FATEST and FATESTCD.How to identify the exact match to remove the duplicate words using FIND/FINDw or PRXMATCH functions


data have;
input string :$200.;
infile datalines dlm=',';
datalines;
apple orange kiwi apple grapes strawberry peach kiwi peach
China USA UK Australia Japanese USA UK Australian Japan Chinase
FOOTBALL BasketBall basketball Hockey football
FACAT FATESTCD FATEST FAOBJ STDT STDTC VISIT VISITNUM
;

data want(keep=string newstring);
   set have;
   newstring=scan(string, 1, ' ');
   do i=2 to countw(string,' ');
      word=scan(string, i, ' ');
      found=find(newstring, word, 'it');   
/*	  fnd=findw(newstring, word, 'it');*/
      if found=0 then newstring=catx(' ', newstring, word);
   end;
run;

andreas_lds · Posted 01-08-2020 01:07 AM

Maybe it is to early for my brain, but what do expect as result?

keen_sas · Posted 01-08-2020 02:34 AM

Current Output

newstring
apple orange kiwi grapes strawberry peach
China USA UK Australia Japanese Australian Chinase------> JAPAN is deleted here,though it is unique word
FOOTBALL BasketBall Hockey
FACAT FATESTCD FAOBJ STDT STDTC VISIT VISITNUM---->FATEST is deleted here,though it is unique word

Expected output:

apple orange kiwi grapes strawberry peach
China USA UK Australia Japanese Australian Japan Chinase
FOOTBALL BasketBall Hockey
FACAT FATESTCD FATEST FAOBJ STDT STDTC VISIT VISITNUM

andreas_lds · Posted 01-08-2020 02:56 AM

The third argument of findw is the list of separating chars, you can't skip that parameter if you want to use the options-parameter. So try

found = findw(newstring, word, ' ', 'sit');

The option S had to be added, because T affects the third parameter, too.

unison · Posted 01-08-2020 01:08 AM

This is my favorite document on prxmatch and other perl expression SAS functions:

https://support.sas.com/rnd/base/datastep/perl_regexp/regexp-tip-sheet.pdf

I use this all of the time.

Take a look!

-unison

KachiM · Posted 01-08-2020 04:36 AM

Hi @keen_sas ,

I do not see any issue in your program. You may add lengths for newstring and word.

Here is the code I tried which seem to give what you want:

data want(keep=string newstring);
   set have;
   length word $30 newstring $200;
   newstring=scan(string, 1, ' ');
   do i=2 to countw(string,' ');
      word=scan(string, i, ' ');
      found=find(newstring, word, 'it');   
      if found=0 then newstring=catx(' ', newstring, word);
   end;
run;

Ksharp · Posted 01-08-2020 06:27 AM

I didn't see anything wrong in output if you are using FINDW().


data want(keep=string newstring);
   set have;
   newstring=scan(string, 1, ' ');
   do i=2 to countw(string,' ');
      word=scan(string, i, ' ');
     /* found=find(newstring, word, 'it');  */ 
	  found=findw(newstring, word, 'it');
      if found=0 then newstring=catx(' ', newstring, word);
   end;
run;

Matching/Removing the exact duplicate word using FIND(W) function