I’m faced with the challenge to modify over 100K concatenated strings of the type:
“Los Lunas/Corona/Corona/White Mountain/White Mountain"
and it should be noted that some source strings may contain 2,3,or 4 each "Corona" or "White Mountain" sub-strings.
The desired string would reduce the number of instances of “White Mountain” and “Corona” to one each, and should look like this:
“Los Lunas/Corona/White Mountain”
I’ve attached the code I used to accomplish this (with a hat tip to blogs by Leonid Batkhan) that shows all the steps. But surely there must be a simpler, more efficient, more elegant way to achieve the desired result.
Thanks in advance for any helpful suggestions or insights.
Happy Holidays!
Gene
data want;
concat_location="Los Lunas/Corona/Corona/White Mountain/White Mountain";
Ccount=count(concat_location,"Corona");
if Ccount>1 then do
/*remove extra "Corona"s */
mod1= prxchange('s/Corona//',Ccount-1,concat_location);
/* translate // to blanks */
mod2=translate(mod1,' ','/');
/* compress blanks */
mod3=compbl(mod2);
/* translate blanks to '/' */
mod4=translate(mod3,'/',' ');
/* Trim leading '/' */
P1=findc(mod4,'/','K');
mod5=substr(mod4,P1);
/* Trim trailing '/' */
P2=findc(mod5,'/','K',-length(mod5));
mod6=substr(mod5,1,P2);
end;
if find(mod6,'White/Mountain')then
mod6=tranwrd(mod6,"White/Mountain","White Mountain");
Wcount=count(mod6,"White Mountain");
if Wcount>1 then do
/* remove extra "White Mountain"s */
mod7= prxchange('s/White Mountain//',Wcount-1,mod6);
/* translate '//'' to blanks */
mod8=translate(mod7,' ','/');
/* compress blanks */
mod9=compbl(mod8);
/* translate blanks to '/' */
mod10=translate(mod9,'/',' ');
/* Trim leading '/' */
P1=findc(mod10,'/','K');
mod11=substr(mod10,P1);
/* Trim trailing '/' */
P2=findc(mod11,'/','K',-length(mod5));
mod12=substr(mod11,1,P2);
end;
if find(mod12,"White/Mountain") then
mod12=tranwrd(mod12,"White/Mountain","White Mountain");
if find(mod12,"Los/Lunas") then
want=tranwrd(mod12,"Los/Lunas","Los Lunas");
run;
Just make a NEW list.
data have;
old="Los Lunas/Corona/Corona/White Mountain/White Mountain";
run;
data want ;
set have;
* trick to force data step to define NEW as same length as OLD ;
new=old;
* Prime the pump with the first word ;
new=scan(old,1,'/');
* when next word is not in NEW then append it ;
do i=2 to countw(old);
if not findw(new,scan(old,i,'/'),'/','it') then
new=catx('/',new,scan(old,i,'/'))
;
end;
drop i;
run;
Results
90 data _null_; 91 set want; 92 put (_all_) (=/); 93 run; old=Los Lunas/Corona/Corona/White Mountain/White Mountain new=Los Lunas/Corona/White Mountain NOTE: There were 1 observations read from the data set WORK.WANT.
How about this?
data want;
keep word;
concat_location = "Los Lunas/Corona/Corona/White Mountain/White Mountain";
do i = 1 to countw(concat_location,"/");
word = scan(concat_location, i, "/");
output;
end;
run;
proc sort data = want nodupkey;
by word;
run;
data want;
keep concat_location_dedup;
set want end = eof;
retain concat_location_dedup;
length concat_location_dedup $ 1024;
concat_location_dedup = catx("/", concat_location_dedup, word);
if eof;
run;
Below a possible approach.
data have;
infile datalines truncover;
input have_string $200.;
datalines;
Los Lunas/Corona/Corona/White Mountain/White Mountain
;
data want(drop=_:);
if _n_=1 then
do;
length _term $50;
dcl hash h1();
h1.defineKey('_term');
h1.defineDone();
end;
set have;
if 0 then want_string=have_string;
do _i=1 to countc(have_string,'/')+1;
_term=scan(have_string,_i,'/');
if h1.check() ne 0 then
do;
want_string=catx('/',want_string,_term);
_rc=h1.add();
end;
end;
_rc=h1.clear();
run;
proc print data=want;
run;
Just make a NEW list.
data have;
old="Los Lunas/Corona/Corona/White Mountain/White Mountain";
run;
data want ;
set have;
* trick to force data step to define NEW as same length as OLD ;
new=old;
* Prime the pump with the first word ;
new=scan(old,1,'/');
* when next word is not in NEW then append it ;
do i=2 to countw(old);
if not findw(new,scan(old,i,'/'),'/','it') then
new=catx('/',new,scan(old,i,'/'))
;
end;
drop i;
run;
Results
90 data _null_; 91 set want; 92 put (_all_) (=/); 93 run; old=Los Lunas/Corona/Corona/White Mountain/White Mountain new=Los Lunas/Corona/White Mountain NOTE: There were 1 observations read from the data set WORK.WANT.
Once again, after struggling for hours over what seemed an intractable puzzle, all I have to do is post a question here, and three better solutions appear in short order. I'll mark Tom's as the Accepted because it seems to be the cleverest and also because the ony thing I know about hash is off-topic.
Thanks and Happy Holidays to all!
Gene
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and save with the early bird rate—just $795!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.