Hi ,
I have repeated sub strings/duplicate words in a string . I would like to remove the repeated words from the string and
keep only the unique values in the string.Words are mix of cases. Any shortcut approach to do this. i am trying to do this using arrays and do loop .
Have | Want |
apple orange kiwi apple grapes strawberry peach kiwi peach | apple orange kiwi grapes strawberry peach |
USA UK Australia Japan USA UK | USA UK Australia Japan |
FOOTBALL BasketBall basketball Hockey football | Football basketball hockey |
data have;
input string :$200.;
infile datalines dlm=',';
datalines;
apple orange kiwi apple grapes strawberry peach kiwi peach
USA UK Australia Japan USA UK
FOOTBALL BasketBall basketball Hockey football
;
data want(keep=string newstring);
set have;
newstring=scan(string, 1, ' ');
do i=2 to countw(string,' ');
word=scan(string, i, ' ');
found=find(newstring, word, 'it');
if found=0 then newstring=catx(' ', newstring, word);
end;
run;
proc print data=want;
run;
This gives you
data have;
input x $80.;
cards;
apple orange kiwi apple grapes strawberry peach kiwi peach
USA UK Australia Japan USA UK
FOOTBALL BasketBall basketball Hockey football
;
run;
data want;
set have;
array a{999} $ 100;
n=0;
do i=1 to countw(x,' ');
temp=lowcase(scan(x,i,' '));
if temp not in a then do; n+1;a{n}=temp;end;
end;
want=catx(' ',of a{*});
drop n i a: temp;
run;
data have;
input string :$200.;
infile datalines dlm=',';
datalines;
apple orange kiwi apple grapes strawberry peach kiwi peach
USA UK Australia Japan USA UK
FOOTBALL BasketBall basketball Hockey football
;
data want(keep=string newstring);
set have;
newstring=scan(string, 1, ' ');
do i=2 to countw(string,' ');
word=scan(string, i, ' ');
found=find(newstring, word, 'it');
if found=0 then newstring=catx(' ', newstring, word);
end;
run;
proc print data=want;
run;
This gives you
I think you should use FINDW(), would be better.
The below code is sometimes excluding the substring with a minor difference in the spelling . Tried with both FIND and FINDW but it is not excluding the exact matching word. In the below example Japan and Japanese are two different words but when using find function JAPAN is getting excluded because of JAPANESE , similarly in the last row when both FATESTCD and FATEST are present only FATESTCD is present , FATEST is getting excluded. Only the exact word should get removed ,but its not happening here.Any suggestions/updates on this.
How can we try this using PRXMATCH,?
data have;
input string :$200.;
infile datalines dlm=',';
datalines;
apple orange kiwi apple grapes strawberry peach kiwi peach
China USA UK Australia Japanese USA UK Australian Japan Chinase
FOOTBALL BasketBall basketball Hockey football
FACAT FATESTCD FATEST FAOBJ STDT STDTC VISIT VISITNUM
;
data want(keep=string newstring);
set have;
newstring=scan(string, 1, ' ');
do i=2 to countw(string,' ');
word=scan(string, i, ' ');
found=find(newstring, word, 'it');
*found=findw(newstring, word, 'it');;
if found=0 then newstring=catx(' ', newstring, word);
end;
run;
With perl regular expression matching:
data have;
input str $80.;
cards;
apple orange kiwi apple grapes strawberry peach kiwi peach
USA UK Australia Japan USA UK
FOOTBALL BasketBall basketball Hockey football
;
data want;
if not prxId then prxId + prxParse("s/\b(\w{2,})\b(.*)\b(\1\s*)\b/\1\2/io");
set have;
wantStr = str;
do i = 1 to 100 until (times=0);
call prxChange(prxId,1,wantStr,wantStr,len,trunc,times);
end;
drop i prxId len trunc times;
run;
proc print; run;
Obs str 1 apple orange kiwi apple grapes strawberry peach kiwi peach 2 USA UK Australia Japan USA UK 3 FOOTBALL BasketBall basketball Hockey football Obs wantStr 1 apple orange kiwi grapes strawberry peach 2 USA UK Australia Japan 3 FOOTBALL BasketBall Hockey
Thank you all for your responses, gained some new approach of programming.
You asked an interesting question and got some interesting answers.
You can help others further by marking one of the provided answers as the solution (not this one).
Regards,
Amir.
I am hoping someone can help with this. I have a similar problem and the code posted her worked but I have over 20 variables with this sort of repeated strings. Do I have to do this over and over for all 20+ variables? Does anyone know any shorter way to do this for multiple variables at once? Can we use this in an array to achieve the same result for many variables?
@dee7 Welcome to SAS Communities. I recommend you ask a new question, supplying the data you have and the data you want in the form of SAS data steps using the datalines statement to supply the data in each case. More people will see the question and so it should get answered more quickly and with a wider range of solutions.
I expect an answer could involve using one of the dictionary tables and macro variables.
Feel free to provide a link to this question also, if desired, but all necessary information should ideally be in your question.
Regards,
Amir.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.