BookmarkSubscribeRSS Feed
keen_sas
Quartz | Level 8

I have duplicate words in a string which mimics other words with a minor differences in characters. But while identifying them using FIND/FINDW function it is considering both as same words and removing them.For example JAPAN and JAPANESE are both distinct words present in the same string , when trying to identify them using FIND/FINDW function both considered as same word and deleting one of them, but ideally both of them are different words. Same for FATEST and FATESTCD.How to identify the exact match to remove the duplicate words using FIND/FINDw or PRXMATCH functions


data have;
input string :$200.;
infile datalines dlm=',';
datalines;
apple orange kiwi apple grapes strawberry peach kiwi peach
China USA UK Australia Japanese USA UK Australian Japan Chinase
FOOTBALL BasketBall basketball Hockey football
FACAT FATESTCD FATEST FAOBJ STDT STDTC VISIT VISITNUM
;

data want(keep=string newstring);
   set have;
   newstring=scan(string, 1, ' ');
   do i=2 to countw(string,' ');
      word=scan(string, i, ' ');
      found=find(newstring, word, 'it');   
/*	  fnd=findw(newstring, word, 'it');*/
      if found=0 then newstring=catx(' ', newstring, word);
   end;
run;

  

6 REPLIES 6
andreas_lds
Jade | Level 19

Maybe it is to early for my brain, but what do expect as result?

keen_sas
Quartz | Level 8

Current Output

newstring
apple orange kiwi grapes strawberry peach
China USA UK Australia Japanese Australian Chinase------> JAPAN is deleted here,though it is unique word
FOOTBALL BasketBall Hockey
FACAT FATESTCD FAOBJ STDT STDTC VISIT VISITNUM---->FATEST is deleted here,though it is unique word

Expected output:

apple orange kiwi grapes strawberry peach
China USA UK Australia Japanese Australian Japan Chinase
FOOTBALL BasketBall Hockey
FACAT FATESTCD FATEST FAOBJ STDT STDTC VISIT VISITNUM 
andreas_lds
Jade | Level 19

The third argument of findw is the list of separating chars, you can't skip that parameter if you want to use the options-parameter. So try

found = findw(newstring, word, ' ', 'sit');

The option S had to be added, because T affects the third parameter, too.

unison
Lapis Lazuli | Level 10

This is my favorite document on prxmatch and other perl expression SAS functions:

https://support.sas.com/rnd/base/datastep/perl_regexp/regexp-tip-sheet.pdf

 

I use this all of the time. 

Take a look!

-unison
KachiM
Rhodochrosite | Level 12

Hi @keen_sas ,

 

I do not see any issue in your program. You may add lengths for newstring and word.

 

Here is the code I tried which seem to give what you want:

 

data want(keep=string newstring);
   set have;
   length word $30 newstring $200;
   newstring=scan(string, 1, ' ');
   do i=2 to countw(string,' ');
      word=scan(string, i, ' ');
      found=find(newstring, word, 'it');   
      if found=0 then newstring=catx(' ', newstring, word);
   end;
run;
Ksharp
Super User

I didn't see anything wrong in output if you are using FINDW().

 


data want(keep=string newstring);
   set have;
   newstring=scan(string, 1, ' ');
   do i=2 to countw(string,' ');
      word=scan(string, i, ' ');
     /* found=find(newstring, word, 'it');  */ 
	  found=findw(newstring, word, 'it');
      if found=0 then newstring=catx(' ', newstring, word);
   end;
run;

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 598 views
  • 0 likes
  • 5 in conversation