DATA Step, Macro, Functions and more

Problem with accented letters in the prx matching functions

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 148
Accepted Solution

Problem with accented letters in the prx matching functions

I have PRX functions that I am using to validate strings. Here are a couple of examples:

 

/^[A-ZÑ&]{3}\d{6}[A-Z0-9]{3}$/

/^[A-ZÑ‘&]{3}\d{6}[A-Z0-9]{3}$/

/^[A-ZÃÑ‘&]{3}\d{6}[A-Z0-9]{3}$/

 

The difference is sometimes only à is allowed in addtion to the upper case letters, sometimes only Ñ, and sometimes both ÃÑ. I can easily handle those variables. It appears the prxmatch function is counting the accented characters as two characters. For example, the following string"

 

LÑL17010ZZZ

 

fails (i.e., returns a 0) when using prxmatch. However if I change the {3} to {3,4}, it returns a 1. So my speculation is that it sees LÑL as four characters. I have run other tests and combinations that seem to confirm this speculation.

So my question is whether there is a way to specify the à and the Ñ characters in the pattern so they are treated as a single character.

 

And note that it is only these two accented letters that are allowed.

 

TIA

 


Accepted Solutions
Solution
‎01-05-2017 11:17 AM
Respected Advisor
Posts: 3,907

Re: Problem with accented letters in the prx matching functions

So far the SAS implementation for Regular Expressions (the prx... functions) only support single byte character sets. 

 

Here the list which string function supports what (SBCS, DBCS, MBCS):

http://support.sas.com/documentation/cdl/en/nlsref/69741/HTML/default/viewer.htm#p1pca7vwjjwucin178l...

 

 

 

View solution in original post


All Replies
Frequent Contributor
Posts: 148

Re: Problem with accented letters in the prx matching functions

I meant to add that I did try specifying these two accented characters in the patter in hex, e.g.,:

 

/^[\xD1\xC3A-Z&]{3}\d{6}[A-Z0-9]{3}$/

 

that also failed.

Trusted Advisor
Posts: 1,405

Re: Problem with accented letters in the prx matching functions

If ignoring accented letters is possible than see next solution:

 

https://communities.sas.com/t5/Base-SAS-Programming/How-to-ignore-accented-text/td-p/140883

Solution
‎01-05-2017 11:17 AM
Respected Advisor
Posts: 3,907

Re: Problem with accented letters in the prx matching functions

So far the SAS implementation for Regular Expressions (the prx... functions) only support single byte character sets. 

 

Here the list which string function supports what (SBCS, DBCS, MBCS):

http://support.sas.com/documentation/cdl/en/nlsref/69741/HTML/default/viewer.htm#p1pca7vwjjwucin178l...

 

 

 

Frequent Contributor
Posts: 148

Re: Problem with accented letters in the prx matching functions

Thanks Patrick, that is what I was afraid the answer would be.

 

And to answer Schmuel's question, ignoring them in not an option.

It is only these characters, and accepting these characters is a key requirement for a large application.

 

So I am going to look into workaround for this (have a few ideas).

Trusted Advisor
Posts: 1,405

Re: Problem with accented letters in the prx matching functions

One more try:

 

I'm not famillar with prxmatch, but can you use hexadecimal in the expression ?

If YES then 

1) use tranwrd to replace the accented letter into a non printable hexa (like 'FA'x)

     to create a temporary variable

2) use prxmath with the hex expression to validate the temporary variable

 

Frequent Contributor
Posts: 148

Re: Problem with accented letters in the prx matching functions

Thanks Shmuel,

 

That is exactly the approach that I was considering. This is part of a large application and we have parameterized most of the checks, including the PRX checks.

 

At the point in the process where the prxmatch function is used, I have a data set that has the values to be validated along with the pattern. So I will add logic to detect if there are accented A or accented N characters in the string to be validated. If so, I will add code to convert them to lower case a and N in both the data value and the pattern - by using temp variables. The lower case letters work for this since another part of the requirements is that only upper case characters are allows for any text/string variable.

 

So thanks for suggesting this approach as it gives me a bit more confidence that I am going down the right path.

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 6 replies
  • 228 views
  • 0 likes
  • 3 in conversation