Solved: Re: Sub-String Theory: Reducing duplicate substrings to single substri...

genemroz

I’m faced with the challenge to modify over 100K concatenated strings of the type:

“Los Lunas/Corona/Corona/White Mountain/White Mountain"

and it should be noted that some source strings may contain 2,3,or 4 each "Corona" or "White Mountain" sub-strings.

The desired string would reduce the number of instances of “White Mountain” and “Corona” to one each, and should look like this:

“Los Lunas/Corona/White Mountain”

I’ve attached the code I used to accomplish this (with a hat tip to blogs by Leonid Batkhan) that shows all the steps. But surely there must be a simpler, more efficient, more elegant way to achieve the desired result.

Thanks in advance for any helpful suggestions or insights.

Happy Holidays!

Gene

data want;

concat_location="Los Lunas/Corona/Corona/White Mountain/White Mountain";

Ccount=count(concat_location,"Corona");
if Ccount>1 then do
/*remove extra "Corona"s  */
mod1= prxchange('s/Corona//',Ccount-1,concat_location); 
/* translate // to blanks */
mod2=translate(mod1,' ','/');
/* compress blanks */
mod3=compbl(mod2);
/* translate blanks to '/' */
mod4=translate(mod3,'/',' ');
/* Trim leading '/' */
P1=findc(mod4,'/','K');
mod5=substr(mod4,P1);
/* Trim trailing '/' */
P2=findc(mod5,'/','K',-length(mod5));
mod6=substr(mod5,1,P2);
end;

if find(mod6,'White/Mountain')then
mod6=tranwrd(mod6,"White/Mountain","White Mountain");

Wcount=count(mod6,"White Mountain");
if Wcount>1 then do
/* remove extra "White Mountain"s */
mod7= prxchange('s/White Mountain//',Wcount-1,mod6); 
/* translate '//'' to blanks */
mod8=translate(mod7,' ','/');
/* compress blanks */
mod9=compbl(mod8);
/* translate blanks to '/' */
mod10=translate(mod9,'/',' ');
/* Trim leading '/' */
P1=findc(mod10,'/','K');
mod11=substr(mod10,P1);
/* Trim trailing '/' */
P2=findc(mod11,'/','K',-length(mod5));
mod12=substr(mod11,1,P2);
end;

if find(mod12,"White/Mountain") then
mod12=tranwrd(mod12,"White/Mountain","White Mountain");
if find(mod12,"Los/Lunas") then
want=tranwrd(mod12,"Los/Lunas","Los Lunas");
run;

Tom

Just make a NEW list.

data have;
 old="Los Lunas/Corona/Corona/White Mountain/White Mountain";
run;

data want ;
  set have;
* trick to force data step to define NEW as same length as OLD ;  
  new=old;
* Prime the pump with the first word ;
  new=scan(old,1,'/');
* when next word is not in NEW then append it ;
  do i=2 to countw(old);
    if not findw(new,scan(old,i,'/'),'/','it') then 
       new=catx('/',new,scan(old,i,'/'))
    ;
  end;
  drop i;
run;

Results

 90         data _null_;
 91           set want;
 92           put (_all_) (=/);
 93         run;
 
 old=Los Lunas/Corona/Corona/White Mountain/White Mountain
 new=Los Lunas/Corona/White Mountain
 NOTE: There were 1 observations read from the data set WORK.WANT.

View solution in original post

SASKiwi

How about this?

data want;
  keep word;
  concat_location = "Los Lunas/Corona/Corona/White Mountain/White Mountain";
  do i = 1 to countw(concat_location,"/");
    word = scan(concat_location, i, "/");
    output;
  end;
run;

proc sort data = want nodupkey;
  by word;
run;

data want;
  keep concat_location_dedup;
  set want end = eof;
  retain concat_location_dedup;
  length concat_location_dedup $ 1024;
  concat_location_dedup = catx("/", concat_location_dedup, word);
  if eof;
run;

Patrick

Below a possible approach.

data have;
  infile datalines truncover;
  input have_string $200.;
  datalines;
Los Lunas/Corona/Corona/White Mountain/White Mountain
;

data want(drop=_:);
  if _n_=1 then
    do;
      length _term $50;
      dcl hash h1();
      h1.defineKey('_term');
      h1.defineDone();
      
    end;
    
  set have;
  
  if 0 then want_string=have_string;
    
  do _i=1 to countc(have_string,'/')+1;
    _term=scan(have_string,_i,'/');
    if h1.check() ne 0 then
      do;
        want_string=catx('/',want_string,_term);
        _rc=h1.add();
      end;
  end;
  
  _rc=h1.clear();
 run;
 
 proc print data=want;
 run;

Tom

Just make a NEW list.

data have;
 old="Los Lunas/Corona/Corona/White Mountain/White Mountain";
run;

data want ;
  set have;
* trick to force data step to define NEW as same length as OLD ;  
  new=old;
* Prime the pump with the first word ;
  new=scan(old,1,'/');
* when next word is not in NEW then append it ;
  do i=2 to countw(old);
    if not findw(new,scan(old,i,'/'),'/','it') then 
       new=catx('/',new,scan(old,i,'/'))
    ;
  end;
  drop i;
run;

Results

 90         data _null_;
 91           set want;
 92           put (_all_) (=/);
 93         run;
 
 old=Los Lunas/Corona/Corona/White Mountain/White Mountain
 new=Los Lunas/Corona/White Mountain
 NOTE: There were 1 observations read from the data set WORK.WANT.

genemroz

Once again, after struggling for hours over what seemed an intractable puzzle, all I have to do is post a question here, and three better solutions appear in short order. I'll mark Tom's as the Accepted because it seems to be the cleverest and also because the ony thing I know about hash is off-topic.

Thanks and Happy Holidays to all!

Gene

Sub-String Theory: Reducing duplicate substrings to single substrings

Re: Sub-String Theory: Reducing duplicate substrings to single substrings

Re: Sub-String Theory: Reducing duplicate substrings to single substrings

Re: Sub-String Theory: Reducing duplicate substrings to single substrings

Re: Sub-String Theory: Reducing duplicate substrings to single substrings

Re: Sub-String Theory: Reducing duplicate substrings to single substrings

SAS Innovate 2026 Registration is Open

Sub-String Theory: Reducing duplicate substrings to single substrings

Re: Sub-String Theory: Reducing duplicate substrings to single substrings

Re: Sub-String Theory: Reducing duplicate substrings to single substrings

Re: Sub-String Theory: Reducing duplicate substrings to single substrings

Re: Sub-String Theory: Reducing duplicate substrings to single substrings

Re: Sub-String Theory: Reducing duplicate substrings to single substrings

SAS Innovate 2026 Registration is Open

SAS Training: Just a Click Away