I need to validate data values before generating an XML file. The XML file specifies patterns for a number of the fields. But they are not standard PERL regex's. XSD patterns are apparently a limited subset and the syntax is slightly different.
If you plug an XSD pattern into prxmatch, it rejects the syntax.
I did find a reference that said you could take an XSD pattern and prefix it with a ^ and suffix it with a $ to convert it to a Perl regex. But prxpattern rejects that also. If you remove the $, it rejects it because it wants a close ^. Adding a ^ as the first and last character results in (for the examples I have checked) a valid pattern. But it seems to accept patterns that are not valid.
So does anybody have any advice on how to convert an XSD pattern to something that can be used in SAS.
TIA.
Seems like all you need to add is a pair of delimiters
\d{5} --> /\d{5}/ will match a substring of 5 digits
or
\d{5} --> /^\d{5}$/ will match if the whole string is 5 digits
Give examples of XSD patterns.
Meant to include a representative set. Thanks for the reminder. Here is a subset from just one of the XSD files.
[A-Z\d\._'\-]+@[A-Z\d_'\-]+\.[A-Z\d\._'\-]+
[A-Z]{2}
[A-Z]{4}\d{6}[MH][A-Z]{5}[0-9]{2}
[A-ZÑ ]{1,200}
[A-ZÑ&]{3,4}\d{6}[A-Z0-9]{3}
[A-ZÑ&]{3}\d{6}[A-Z0-9]{3}
[A-ZÑ&]{4}\d{6}[A-Z0-9]{3}
[A-ZÑ0-9]{1,14}
[A-ZÑ\d #\-\.&,_@'()]{1,254}
[A-ZÑ\d \-\.,':/$]{1,3000}
[A-ZÑ\d \-\.,:/]{1,100}
[A-ZÑ\d \-_\.&,'#@]{1,200}
\d{1,14}\.\d{2}
\d{1,2}
\d{4}[0|1]\d{1}
\d{4}\-\d{1,9}
\d{5}
Found this site that discusses the differences. I tried the suggestion to prefix the pattern with a ^ and suffix it with a $. But that did not create an expression that prxparse accepted.
Also found a few sites that decode the pattern into a description. From which I could presumably create a valid perl regex expression. But given how many of these I have to create, I would prefer to avoid that approach if at all possible.
Seems like all you need to add is a pair of delimiters
\d{5} --> /\d{5}/ will match a substring of 5 digits
or
\d{5} --> /^\d{5}$/ will match if the whole string is 5 digits
Very helpful. Thx. So let me first acknowledge that to say I am a novice on regex patterns would give me too much credit.
So, am I correct in assuming that prefixing with /^ and suffixing with $/ will check for an exact match. So for example, 12345x, will fail because it is not an exact match?
Right! "12345 " will not match either because of the trailing spaces, but trim("12345 ") will match.
Thanks. I had already thought about that and was doing a strip of the string.
Note that if you want to match accented letters, the pattern
[A-ZÑ0-9]{1,14}
can be extended using a posix character class
[[:upper:]0-9]{1,14}
if you need to catch other accented letters.
The letters matched depend on the encoding. For example wlatin1 matches most Western Europe accents like Ñ (Spanish) or Ø (Swedish).
Taken from http://www.amazon.com/High-Performance-SAS-Coding-Christian-Graffeuille/dp/1512397490
Thanks. For now this project is using UTF-8 encoding and we only need to support Spanish. But this is a good tip.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.