Dear all,
How can I unify below strings to 'UK'?
data have ;
infile datalines truncover;
input name $100.;
datalines;
ABC (U.K.)
ABC (U K )
ABC [U.K.]
ABCU. K. /*this one should not be unified*/
ABC {U. K.}
ABC 'U K'
AB C U K
;
run;
I expect to get
ABC (UK)
ABC (UK)
ABC [UK]
ABCU. K. /*this one should not be unified*/
ABC {UK}
ABC 'UK'
AB C UK
Could you please give me some suggestions about this?
thanks in advance
Look for "perl regexp" in google.
Here \b means a word boundary.
[. ] means a space or a period
[. ]+ means one or more spaces/periods
U and K are the letters U and K
In ABC 'U K', K is followed by ' and so does not match the regexp.
You can drop the final [. ]+
s/\bU\b[. ]+\bK\b/UK/i
Please post what you expect as output for each observation.
thanks
I expect to get
ABC (UK)
ABC (UK)
ABC [UK]
ABCU. K. /*this one should not be unified*/
ABC {UK}
ABC 'UK'
AB C UK
Hello,
You might have to adapt a bit depending on the specific behavior you want.
data want;
set have;
unif=prxchange("s/\bU\b[. ]+\bK\b[. ]+/UK/i",-1,name);
run;
Edit : I slighly modified the regular expression after your reply to @andreas_lds
Dear @gamotte
thanks for your code,
but the
ABC 'U K'
is not been processed.
Besides, Could you please introduce some manual to learn the meaning of following codes
s/\bU\b[. ]+\bK\b[. ]+/UK/i
and what is the type/name of this code?(i.e., how do you call this code?)
Look for "perl regexp" in google.
Here \b means a word boundary.
[. ] means a space or a period
[. ]+ means one or more spaces/periods
U and K are the letters U and K
In ABC 'U K', K is followed by ' and so does not match the regexp.
You can drop the final [. ]+
s/\bU\b[. ]+\bK\b/UK/i
I would run this to cover more cases:
UNIF=prxchange("s/ [^\w\d]*U[^\w\d]+K[^\w\d]*$/ UK/i",1,NAME);
This cleans:
space
followed by optional non-alphanumerics
followed by U
followed by non-alphanumerics
followded by K
followed by optional non-alphanumerics
then end of string
The final i make this case insensitive, which may not be what you want.
data have ;
infile datalines truncover;
input name $100.;
datalines;
ABC (U.K.)
ABC (U K )
ABC [U.K.]
ABCU. K. /*this one should not be unified*/
ABC {U. K.}
ABC 'U K'
AB C U K
;
run;
data want;
set have;
pid=prxparse('/\bu\W+k\b/i');
call prxsubstr(pid,name,p,l);
if p>0 then do;
if substr(name,p+l,1) in (' ' '.') then l=l+1;
substr(name,p,l)= 'UK';
end;
run;
proc print;run;
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.