Solved: Re: unify strings to 'UK'?

Alexxxxxxx · Posted 03-27-2019 11:03 AM

Dear all,

How can I unify below strings to 'UK'?

data have ;
  infile datalines truncover;
  input name $100.;
  datalines;
ABC (U.K.)
ABC (U K )
ABC [U.K.]
ABCU. K.  /*this one should not be unified*/
ABC {U. K.}
ABC 'U K'
AB C U K
;
run;

I expect to get

ABC (UK)
ABC (UK)
ABC [UK]
ABCU. K.  /*this one should not be unified*/
ABC {UK}
ABC 'UK'
AB C UK

Could you please give me some suggestions about this?

thanks in advance

gamotte · Posted 03-27-2019 12:54 PM

Look for "perl regexp" in google.

Here \b means a word boundary.

[. ] means a space or a period

[. ]+ means one or more spaces/periods

U and K are the letters U and K

In ABC 'U K', K is followed by ' and so does not match the regexp.

You can drop the final [. ]+

s/\bU\b[. ]+\bK\b/UK/i

View solution in original post

andreas_lds · Posted 03-27-2019 11:15 AM

Please post what you expect as output for each observation.

Alexxxxxxx · Posted 03-27-2019 11:31 AM

thanks

I expect to get

ABC (UK)
ABC (UK)
ABC [UK]
ABCU. K.  /*this one should not be unified*/
ABC {UK}
ABC 'UK'
AB C UK

gamotte · Posted 03-27-2019 11:25 AM

Hello,

You might have to adapt a bit depending on the specific behavior you want.

data want;
set have;
unif=prxchange("s/\bU\b[. ]+\bK\b[. ]+/UK/i",-1,name);
run;

Edit : I slighly modified the regular expression after your reply to @andreas_lds

Alexxxxxxx · Posted 03-27-2019 12:30 PM

Dear @gamotte

thanks for your code,

but the

ABC 'U K'

is not been processed.

Besides, Could you please introduce some manual to learn the meaning of following codes

s/\bU\b[. ]+\bK\b[. ]+/UK/i

and what is the type/name of this code?(i.e., how do you call this code?)

gamotte · Posted 03-27-2019 12:54 PM

Look for "perl regexp" in google.

Here \b means a word boundary.

[. ] means a space or a period

[. ]+ means one or more spaces/periods

U and K are the letters U and K

In ABC 'U K', K is followed by ' and so does not match the regexp.

You can drop the final [. ]+

s/\bU\b[. ]+\bK\b/UK/i

ChrisNZ · Posted 03-27-2019 06:35 PM

I would run this to cover more cases:

UNIF=prxchange("s/ [^\w\d]*U[^\w\d]+K[^\w\d]*$/ UK/i",1,NAME);

This cleans:

space

followed by optional non-alphanumerics

followed by U

followed by non-alphanumerics

followded by K

followed by optional non-alphanumerics

then end of string

The final i make this case insensitive, which may not be what you want.

High-Performance SAS Coding - Third Edition

Ksharp · Posted 03-28-2019 10:15 AM

data have ;
  infile datalines truncover;
  input name $100.;
  datalines;
ABC (U.K.)
ABC (U K )
ABC [U.K.]
ABCU. K.  /*this one should not be unified*/
ABC {U. K.}
ABC 'U K'
AB C U K
;
run;

data want;
 set have;
 pid=prxparse('/\bu\W+k\b/i');
 call prxsubstr(pid,name,p,l);
 if p>0 then do; 
 if substr(name,p+l,1) in (' ' '.') then l=l+1;
 substr(name,p,l)= 'UK'; 
 end;
run;
proc print;run;

Registration is open

SAS Training: Just a Click Away