BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Alexxxxxxx
Pyrite | Level 9

Dear all,

 

How can I unify below strings to 'UK'? 

 

data have ;
  infile datalines truncover;
  input name $100.;
  datalines;
ABC (U.K.)
ABC (U K )
ABC [U.K.]
ABCU. K.  /*this one should not be unified*/
ABC {U. K.}
ABC 'U K'
AB C U K
;
run;

I expect to get 

ABC (UK)
ABC (UK)
ABC [UK]
ABCU. K.  /*this one should not be unified*/
ABC {UK}
ABC 'UK'
AB C UK

 

Could you please give me some suggestions about this?

thanks in advance

1 ACCEPTED SOLUTION

Accepted Solutions
gamotte
Rhodochrosite | Level 12

Look for "perl regexp" in google.

 

Here \b means a word boundary.

[. ] means a space or a period

[. ]+ means one or more spaces/periods

U and K are the letters U and K

 

In ABC 'U K', K is followed by ' and so does not match the regexp.

 

You can drop the final [. ]+

s/\bU\b[. ]+\bK\b/UK/i

View solution in original post

7 REPLIES 7
andreas_lds
Jade | Level 19

Please post what you expect as output for each observation.

Alexxxxxxx
Pyrite | Level 9

thanks 

I expect to get 

ABC (UK)
ABC (UK)
ABC [UK]
ABCU. K.  /*this one should not be unified*/
ABC {UK}
ABC 'UK'
AB C UK

 

 

gamotte
Rhodochrosite | Level 12

Hello,

 

You might have to adapt a bit depending on the specific behavior you want.

 

data want;
set have;
unif=prxchange("s/\bU\b[. ]+\bK\b[. ]+/UK/i",-1,name);
run;

Edit : I slighly modified the regular expression after your reply to @andreas_lds 

Alexxxxxxx
Pyrite | Level 9

Dear @gamotte 

thanks for your code,

 

but the 

ABC 'U K'

is not been processed.

 

Besides,  Could you please introduce some manual to learn the meaning of following codes

s/\bU\b[. ]+\bK\b[. ]+/UK/i

and what is the type/name of this code?(i.e., how do you call this code?)

gamotte
Rhodochrosite | Level 12

Look for "perl regexp" in google.

 

Here \b means a word boundary.

[. ] means a space or a period

[. ]+ means one or more spaces/periods

U and K are the letters U and K

 

In ABC 'U K', K is followed by ' and so does not match the regexp.

 

You can drop the final [. ]+

s/\bU\b[. ]+\bK\b/UK/i
ChrisNZ
Tourmaline | Level 20

I would run this to cover more cases:

UNIF=prxchange("s/ [^\w\d]*U[^\w\d]+K[^\w\d]*$/ UK/i",1,NAME);

This cleans:

space

followed by optional non-alphanumerics 

followed by U

followed by non-alphanumerics 

followded by K

followed by optional non-alphanumerics 

then end of string

The final i make this case insensitive, which may not be what you want.

 

 

Ksharp
Super User
data have ;
  infile datalines truncover;
  input name $100.;
  datalines;
ABC (U.K.)
ABC (U K )
ABC [U.K.]
ABCU. K.  /*this one should not be unified*/
ABC {U. K.}
ABC 'U K'
AB C U K
;
run;

data want;
 set have;
 pid=prxparse('/\bu\W+k\b/i');
 call prxsubstr(pid,name,p,l);
 if p>0 then do; 
 if substr(name,p+l,1) in (' ' '.') then l=l+1;
 substr(name,p,l)= 'UK'; 
 end;
run;
proc print;run;

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 2169 views
  • 5 likes
  • 5 in conversation