I’m faced with the challenge to modify over 100K concatenated strings of the type:
“Los Lunas/Corona/Corona/White Mountain/White Mountain"
and it should be noted that some source strings may contain 2,3,or 4 each "Corona" or "White Mountain" sub-strings.
The desired string would reduce the number of instances of “White Mountain” and “Corona” to one each, and should look like this:
“Los Lunas/Corona/White Mountain”
I’ve attached the code I used to accomplish this (with a hat tip to blogs by Leonid Batkhan) that shows all the steps. But surely there must be a simpler, more efficient, more elegant way to achieve the desired result.
Thanks in advance for any helpful suggestions or insights.
Happy Holidays!
Gene
data want;
concat_location="Los Lunas/Corona/Corona/White Mountain/White Mountain";
Ccount=count(concat_location,"Corona");
if Ccount>1 then do
/*remove extra "Corona"s */
mod1= prxchange('s/Corona//',Ccount-1,concat_location);
/* translate // to blanks */
mod2=translate(mod1,' ','/');
/* compress blanks */
mod3=compbl(mod2);
/* translate blanks to '/' */
mod4=translate(mod3,'/',' ');
/* Trim leading '/' */
P1=findc(mod4,'/','K');
mod5=substr(mod4,P1);
/* Trim trailing '/' */
P2=findc(mod5,'/','K',-length(mod5));
mod6=substr(mod5,1,P2);
end;
if find(mod6,'White/Mountain')then
mod6=tranwrd(mod6,"White/Mountain","White Mountain");
Wcount=count(mod6,"White Mountain");
if Wcount>1 then do
/* remove extra "White Mountain"s */
mod7= prxchange('s/White Mountain//',Wcount-1,mod6);
/* translate '//'' to blanks */
mod8=translate(mod7,' ','/');
/* compress blanks */
mod9=compbl(mod8);
/* translate blanks to '/' */
mod10=translate(mod9,'/',' ');
/* Trim leading '/' */
P1=findc(mod10,'/','K');
mod11=substr(mod10,P1);
/* Trim trailing '/' */
P2=findc(mod11,'/','K',-length(mod5));
mod12=substr(mod11,1,P2);
end;
if find(mod12,"White/Mountain") then
mod12=tranwrd(mod12,"White/Mountain","White Mountain");
if find(mod12,"Los/Lunas") then
want=tranwrd(mod12,"Los/Lunas","Los Lunas");
run;
Just make a NEW list.
data have;
old="Los Lunas/Corona/Corona/White Mountain/White Mountain";
run;
data want ;
set have;
* trick to force data step to define NEW as same length as OLD ;
new=old;
* Prime the pump with the first word ;
new=scan(old,1,'/');
* when next word is not in NEW then append it ;
do i=2 to countw(old);
if not findw(new,scan(old,i,'/'),'/','it') then
new=catx('/',new,scan(old,i,'/'))
;
end;
drop i;
run;
Results
90 data _null_; 91 set want; 92 put (_all_) (=/); 93 run; old=Los Lunas/Corona/Corona/White Mountain/White Mountain new=Los Lunas/Corona/White Mountain NOTE: There were 1 observations read from the data set WORK.WANT.
How about this?
data want;
keep word;
concat_location = "Los Lunas/Corona/Corona/White Mountain/White Mountain";
do i = 1 to countw(concat_location,"/");
word = scan(concat_location, i, "/");
output;
end;
run;
proc sort data = want nodupkey;
by word;
run;
data want;
keep concat_location_dedup;
set want end = eof;
retain concat_location_dedup;
length concat_location_dedup $ 1024;
concat_location_dedup = catx("/", concat_location_dedup, word);
if eof;
run;
Below a possible approach.
data have;
infile datalines truncover;
input have_string $200.;
datalines;
Los Lunas/Corona/Corona/White Mountain/White Mountain
;
data want(drop=_:);
if _n_=1 then
do;
length _term $50;
dcl hash h1();
h1.defineKey('_term');
h1.defineDone();
end;
set have;
if 0 then want_string=have_string;
do _i=1 to countc(have_string,'/')+1;
_term=scan(have_string,_i,'/');
if h1.check() ne 0 then
do;
want_string=catx('/',want_string,_term);
_rc=h1.add();
end;
end;
_rc=h1.clear();
run;
proc print data=want;
run;
Just make a NEW list.
data have;
old="Los Lunas/Corona/Corona/White Mountain/White Mountain";
run;
data want ;
set have;
* trick to force data step to define NEW as same length as OLD ;
new=old;
* Prime the pump with the first word ;
new=scan(old,1,'/');
* when next word is not in NEW then append it ;
do i=2 to countw(old);
if not findw(new,scan(old,i,'/'),'/','it') then
new=catx('/',new,scan(old,i,'/'))
;
end;
drop i;
run;
Results
90 data _null_; 91 set want; 92 put (_all_) (=/); 93 run; old=Los Lunas/Corona/Corona/White Mountain/White Mountain new=Los Lunas/Corona/White Mountain NOTE: There were 1 observations read from the data set WORK.WANT.
Once again, after struggling for hours over what seemed an intractable puzzle, all I have to do is post a question here, and three better solutions appear in short order. I'll mark Tom's as the Accepted because it seems to be the cleverest and also because the ony thing I know about hash is off-topic.
Thanks and Happy Holidays to all!
Gene
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.