BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Alexxxxxxx
Pyrite | Level 9

Hello,

 

How to find the Neighbouring Repetitive Words?and get the repeated word.

 

For example, for the table 'have'

APPLE LTD LTD 
USA Australia Japan USA
FOOTBALL LTD FOOTBALL LP

I would like to get

APPLE LTD LTD | LTD

Because it has an 'LTD LTD' which are together and are the same words. and then I would like to get a new variable which lists the word. in this example is ‘LTD'.

 

the

USA Australia Japan USA
FOOTBALL LTD FOOTBALL LP

should not be extracted. although they have the same words they are not together.

Could you please give me some suggestion about this? Thanks in advance.

 

 

data have;
input string :$200.;
infile datalines dlm=',';
string=upcase(string);
datalines;
APPLE LTD LTD 
USA Australia Japan USA
FOOTBALL LTD FOOTBALL LP
;
run;

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
novinosrin
Tourmaline | Level 20

Hi @Alexxxxxxx 

 

Some fun stuff

 


data have;
input string :$200.;
infile datalines dlm=',';
string=upcase(string);
datalines;
APPLE LTD LTD 
USA Australia Japan USA
FOOTBALL LTD FOOTBALL LP
APPLE LTD LTD INC INC
;
run;

data want;
if _n_ then do;
   dcl hash H () ;
   h.definekey  ("repeat") ;
   h.definedata ("repeat") ;
   h.definedone () ;
   dcl hiter hi('h');
 end;
set have;
do _n_=2 to countw(string,' ');
 if scan(string,_n_,' ')=scan(string,_n_-1,' ') then do;
 repeat= scan(string,_n_,' ');
 h.replace();
 end;
end;
do while(hi.next()=0);
 output;
end;
h.clear();
run;

View solution in original post

8 REPLIES 8
Reeza
Super User
How much data are you working with?
I usually recommend splitting the data to individual words and then doing the analysis which is easier.

*Create sample data;
data random_sentences;
infile cards truncover;
informat sentence $256.;
input sentence $256.;
cards;
This is a random sentence
This is another random sentence
Happy Birthday
My job sucks.
This is a good idea, not.
This is an awesome idea!
How are you today?
Does this make sense?
Have a great day!
;
;
;
;

*Partition into words;
data f1;
set random_sentences;
id=_n_;
nwords=countw(sentence);
nchar=length(compress(sentence));

do word_order=1 to nwords;
word=scan(sentence, word_order);
output;
end;
run;

https://github.com/statgeek/SAS-Tutorials/blob/master/text_analysis.sas
novinosrin
Tourmaline | Level 20

Hello @Alexxxxxxx  Are you expecting to have just one set of repeating words in the string or more. If more, what would the result look like?

Alexxxxxxx
Pyrite | Level 9

Hello,

 

for the table 

name
APPLE LTD LTD 
USA Australia Japan USA
FOOTBALL LTD FOOTBALL LP

 

I expect to get 

namerepeat
APPLE LTD LTD LTD
novinosrin
Tourmaline | Level 20

I understood that. That is as simple as

 


data have;
input string :$200.;
infile datalines dlm=',';
string=upcase(string);
datalines;
APPLE LTD LTD 
USA Australia Japan USA
FOOTBALL LTD FOOTBALL LP
;
run;

data want;
set have;
do _n_=2 to countw(string,' ');
 if scan(string,_n_,' ')=scan(string,_n_-1,' ') then want=scan(string,_n_,' ');
end;
run;

My question though is what if there are more one set of repeating words.

 

For example,

APPLE LTD LTD INC INC

Alexxxxxxx
Pyrite | Level 9

Hello, 

 

appreciate for your remind.

 

I expect to get both of them if it happens.

 

just like

namerepeat
APPLE LTD LTD INC INCLTD
APPLE LTD LTD INC INCINC

 

Could you please give me some suggestions about this?

thanks a lot.

novinosrin
Tourmaline | Level 20

Hi @Alexxxxxxx 

 

Some fun stuff

 


data have;
input string :$200.;
infile datalines dlm=',';
string=upcase(string);
datalines;
APPLE LTD LTD 
USA Australia Japan USA
FOOTBALL LTD FOOTBALL LP
APPLE LTD LTD INC INC
;
run;

data want;
if _n_ then do;
   dcl hash H () ;
   h.definekey  ("repeat") ;
   h.definedata ("repeat") ;
   h.definedone () ;
   dcl hiter hi('h');
 end;
set have;
do _n_=2 to countw(string,' ');
 if scan(string,_n_,' ')=scan(string,_n_-1,' ') then do;
 repeat= scan(string,_n_,' ');
 h.replace();
 end;
end;
do while(hi.next()=0);
 output;
end;
h.clear();
run;

hashman
Ammonite | Level 13

@Alexxxxxxx:

Try this:

data have ;                                                               
  input @1 str & $upcase30. ;                                             
  cards ;                                                                 
APPLE LTD LTD                                                             
USA Australia Japan USA                                                   
FOOTBALL LTD FOOTBALL LP                                                  
APPLE LTD LTD INC INC                                                     
;                                                                         
run ;                                                                     
                                                                          
data want (drop = _:) ;                                                   
  set have ;                                                              
  length _s $ 32767 repeat $ 30 ;                                         
  do _x = 1 to countw (str) ;                                             
    repeat = scan (str, _x) ;                                             
    if repeat ne scan (str, _x + 1) or findw (_s, repeat) then continue ; 
    output ;                                                              
    _s = catx (" ", _s, repeat) ;                                         
  end ;                                                                   
run ;                                                                     

Kind regards

Paul D. 

PeterClemmensen
Tourmaline | Level 20
data have;
input string :$200.;
infile datalines dlm=',';
string=upcase(string);
datalines;
APPLE LTD LTD 
USA Australia Japan USA
FOOTBALL LTD FOOTBALL LP
APPLE LTD LTD INC INC
;
run;

data want(drop=st s l);
    if _N_ = 1 then _iorc_=prxparse('/\b(\w+)\b\s*(\1)\b/');
    set have;
    st=string;
    do while (prxmatch(_iorc_, st));
        repeat=prxposn(_iorc_, 2, st);
        output;
        call prxposn(_iorc_, 2, s, l);
        st=substr(st, s+l+1);
    end;
run;

Result:

 

string                  repeat
APPLE LTD LTD	        LTD
APPLE LTD LTD INC INC	LTD
APPLE LTD LTD INC INC	INC

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 8 replies
  • 1811 views
  • 6 likes
  • 5 in conversation