Remove duplicates from a text

Accepted Solution Solved
Reply
Super Contributor
Posts: 336
Accepted Solution

Remove duplicates from a text


Hello!

I would like to remove the duplicate parts of a character variable:

Data Have;
  Input N $63.;
  Datalines;
208_01_460_03_461_02_469_01_46x_02_461_02
208_01_460_03_461_02_469_01_46x_03_460_03_461_02_469_01_461_02
208_01_460_03_461_02_469_02_46x_01_461_02                    
208_01_460_03_461_02_469_02_46x_02_461_02                    
208_01_460_03_461_02_469_02_46x_03_460_03_461_02_469_02_461_02
;
Run;

The desired output is (sorting is not really required, however it would be nice if the 1st entry stays and every following identical entry is removed):


208_01_460_03_461_02_469_01_46x_02 -> remove 1x 461_01
208_01_460_03_461_02_469_01_46x_03 -> remove 460_03, 461_02 (2x!), 469_01
208_01_460_03_461_02_469_02_46x_01 -> remove 461_02 , etc.

Could somebody please help?


Accepted Solutions
Solution
‎10-29-2014 08:11 AM
Grand Advisor
Posts: 9,576

Re: Remove duplicates from a text

Token is 460_03 not 460 or 03 ?



Data Have;
  Input N $63.;
  Datalines;
208_01_460_03_461_02_469_01_46x_02_461_02
208_01_460_03_461_02_469_01_46x_03_460_03_461_02_469_01_461_02
208_01_460_03_461_02_469_02_46x_01_461_02                     
208_01_460_03_461_02_469_02_46x_02_461_02                     
208_01_460_03_461_02_469_02_46x_03_460_03_461_02_469_02_461_02
;
Run;
data want;
 set have;
 length new token $ 100;
 do i=1 to countw(n,'_') by 2;
  token=catx('_',scan(n,i,'_'),scan(n,i+1,'_'));  put token=;
  if not find(new,token,'t') then new=catx('_',new,token);
 end;
run;

Xia Keshan

View solution in original post


All Replies
Esteemed Advisor
Esteemed Advisor
Posts: 7,203

Re: Remove duplicates from a text

Perhaps something along the lines of:

data want;

     set have;

     I=2; /* as 1 can only have itself */

     do until (scan(n,'_',i)="");

          del=0;

          do j=I to 1 by -1;

               if scan(n,'_',i)=scan(n,'_',j) then del=1;

           end;

          if del=0 then new_val=strip(new_val)||"_"||scan(n,'_',i);

          I=I+1;

     end;

run;

**Note not tested as leaving now.

Solution
‎10-29-2014 08:11 AM
Grand Advisor
Posts: 9,576

Re: Remove duplicates from a text

Token is 460_03 not 460 or 03 ?



Data Have;
  Input N $63.;
  Datalines;
208_01_460_03_461_02_469_01_46x_02_461_02
208_01_460_03_461_02_469_01_46x_03_460_03_461_02_469_01_461_02
208_01_460_03_461_02_469_02_46x_01_461_02                     
208_01_460_03_461_02_469_02_46x_02_461_02                     
208_01_460_03_461_02_469_02_46x_03_460_03_461_02_469_02_461_02
;
Run;
data want;
 set have;
 length new token $ 100;
 do i=1 to countw(n,'_') by 2;
  token=catx('_',scan(n,i,'_'),scan(n,i+1,'_'));  put token=;
  if not find(new,token,'t') then new=catx('_',new,token);
 end;
run;

Xia Keshan

Super Contributor
Posts: 336

Re: Remove duplicates from a text

Many Thanks (yes the token is 460_03)! Works perfectly!

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 3 replies
  • 188 views
  • 3 likes
  • 3 in conversation