DATA Step, Macro, Functions and more

Regular expression for Unicode letters

Accepted Solution Solved
Reply
Contributor
Posts: 35
Accepted Solution

Regular expression for Unicode letters

I want to match all English and international letters like áñéúáóíÁÑçâèôïéêëààñÉ, but I don't want to match underscore and whatever else regex consider to be part of a "word."

In the following article, I tried \p{L} but it doesn't work in SAS 9.2

regex - How to match the international alphabet (English a-z, + non English) with a regular expressi...


Accepted Solutions
Solution
‎05-09-2012 12:12 PM
Respected Advisor
Posts: 3,799

Re: Regular expression for Unicode letters

2965  data _null_;

2966     infile cards dsd dlm=':';

2967     input x $ name:$26.;

2968     valid = not lengthn(compress(name,' .-','A'));

2969     put (_all_)(=);

2970     cards;

x=Good name=Martha Jones-Smith valid=1

x=Invalid name=Martha Jones=Smith valid=0

x=Good name=Robert Smith Jr. valid=1

x=Invalid name=Robert Smith Jr. (Bob) valid=0

x=Invalid name=Robert Smith Jr, valid=0

x=Invalid name=0áñéúáóíÁÑçâèôïéêëààñÉ valid=0

x=Good name=áñéúáóíÁÑçâèôïéêëààñÉ valid=1

View solution in original post


All Replies
Respected Advisor
Posts: 3,799

Re: Regular expression for Unicode letters

Would ANYALPHA help.

2818  data _null_;

2819     a='0áñéúáóíÁÑçâèôïéêëààñÉ';

2820     b=anyalpha(a);

2821     put _all_;

2822     run;

a=0áñéúáóíÁÑçâèôïéêëààñÉ b=2 _ERROR_=0 _N_=1

Trusted Advisor
Posts: 1,301

Re: Regular expression for Unicode letters

Posted in reply to data_null__

I am not sure from the way you ask you question, but I this you want to remove all non-alpha characters from a string that contains both english and non-english (such as French, Spanish, or German, i.e. non DBCS languages).  \p is not a valid metacharacter for SAS, even though it is for perl.  There is nothing directly equivalent that comes to mind however I believe [[:alpha:]] with work for your needs.  This could also be accomplished using compress.

13         data _null_;

14          a='0áñéúáóí-ÁÑçâè _ôïéêëààñÉ';

15          b=prxchange('s/[[:^alpha:]]//o',-1,a);

16          c=compress(a,,'ka');

17          put (a--c) (=/);

18         run;

a=0áñéúáóí-ÁÑçâè _ôïéêëààñÉ

b=áñéúáóíÁÑçâèôïéêëààñÉ

c=áñéúáóíÁÑçâèôïéêëààñÉ

If you do care about DBCS languages look into KCOMPRESS

Contributor
Posts: 35

Re: Regular expression for Unicode letters

Thank you for the replies, FriedEgg and Ksharp, but I don't want to change any characters.  I'm validating a person's name contains only valid characters (explained more in post #4).

Contributor
Posts: 35

Re: Regular expression for Unicode letters

Posted in reply to data_null__

Thank you for the reply data_null_.  ANYALPHA is interesting but not helpful.  To be more specific, I want to make sure a person's name (as it is recorded) contains only valid characters which are alphabetic characters (including international characters), space, period, and hyphen.  For example

Good: Martha Jones-Smith

Invalid: Martha Jones=Smith

Good: Robert Smith Jr.

Invalid: Robert Smith Jr. (Bob)

Invalid: Robert Smith Jr,

Solution
‎05-09-2012 12:12 PM
Respected Advisor
Posts: 3,799

Re: Regular expression for Unicode letters

2965  data _null_;

2966     infile cards dsd dlm=':';

2967     input x $ name:$26.;

2968     valid = not lengthn(compress(name,' .-','A'));

2969     put (_all_)(=);

2970     cards;

x=Good name=Martha Jones-Smith valid=1

x=Invalid name=Martha Jones=Smith valid=0

x=Good name=Robert Smith Jr. valid=1

x=Invalid name=Robert Smith Jr. (Bob) valid=0

x=Invalid name=Robert Smith Jr, valid=0

x=Invalid name=0áñéúáóíÁÑçâèôïéêëààñÉ valid=0

x=Good name=áñéúáóíÁÑçâèôïéêëààñÉ valid=1

Super User
Posts: 10,041

Re: Regular expression for Unicode letters

I would like to use KSUBSTR() to pull out a single character each time and compare it with the values you don't want. Some Dummy code like:

i=1;

_temp=ksubstr(name,i,1);

do while(not missing(_temp));

if  _temp  not in ('_' '0' '1' 'a' 'b') then put 'Found:'  _temp;

i+1;

_temp=ksubstr(name,i,1);

end;

But I think it is just beginning, This problem is very annoying for me for a long time .

Ksharp

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 6 replies
  • 555 views
  • 0 likes
  • 4 in conversation