BookmarkSubscribeRSS Feed

How do I use regular expressions in SAS®? Q&A, slides, and on-demand recording

Started ‎01-25-2022 by
Modified ‎02-25-2022 by
Views 5,686

Watch this Ask the Expert session to learn the universal pattern language for regular expressions, the SAS specific syntax, and the pros and cons of this tool

 

Watch the webinar

 

You will learn:

  • The regular expression language (pattern matching).
  • How and when to use regular expressions within SAS.
  • Pros and cons of regular expressions versus other SAS functions.

The questions from the Q&A segment held at the end of the webinar are listed below and the slides from the webinar are attached.

Q&A

What if the data has no apparent patterns? What do you think is the best way to deal with free texts?

The regular expression is really about patterns, so if you don't have any patterns, then it really becomes another question and not one of regular expression per say. It really becomes more of a word mining or data mining issue than a regular expression one. Maybe if you can't identify a pattern that describes the whole string, maybe you can identify, for example that in a given free form text, you always have a 30-character long word somewhere that means something. And, you can have garbage before and garbage after. It would still be a match; it would still be a pattern. It just means that it would be kind of a loose pattern, but there could still be value to extract some components from.

 

Is there any character class specifically for non-printing classes other than making your own list (and would it recognize hex for things like tabs, etc.)?

You have to make your own. And I believe tab is \t, but it could also \s, the space, might also catch tabs, but it's something you could easily test at regex101.com.

 

Does it behave differently from SAS 9.4 to SAS Viya?

It shouldn't, but that actually makes me think about something I didn't cover. If you look at the two-page SAS tip sheet, you're going to see some more arguments to functions, and I believe those might be taken out the version in Viya. It should be the same syntax, but I've seen different documentation for Viya than the one in the SAS tip sheet.

 

Can you use Python Expressions in the same manner as these Perl expressions?

Yes, you should be able to. I'm pretty sure Python's regular expressions are the same syntax as Perl regular expressions and therefore SAS regular expressions.

 

Does \s* means 0 or any number of spaces?

Exactly. So, let's say I have a phone number field which has 14 numbers, and there's a way of filling all 14 spaces using parentheses for the error code and everything. So, in this situation, the very first character of that string would be the opening parentheses, and the very last character of the string would be the last digit of the line number. So, in this case, if we use a regular expression that matches a phone number and we have \s* at the end before the $, it will still match this because there's actually no space after the last character. But, if you have another in that same variable, you have another phone number where there's no parentheses and it's just dashes. So, you have two extra spaces that are free that are going to be at the end of your string. And, that's where it becomes important to match with that, while allowing any number of spaces at the end of your regular expression. And if it doesn't, if it still doesn't tell you why, just remember that \s* at the end of your string and when you're going to test regular expressions and you're going to bound that regular expression with the ^ and $, at some point, you're probably are going to be missing some matches that you feel like they should be matching. And then, you're going to understand what that character role plays.

 

Do you have suggestions on how to debug these effectively? Those regexp are very cryptic and complex, which seemingly would be hard to find bugs.

Debugging errors is fine. Let's say you have a regular expression that doesn't match anything. Use regex101.com to test these regular expressions. When you have one that matches, then it should work with SAS. If it doesn't, maybe you're missing a \s* at the end, so you might not be accounting for extra spaces at the end. The worst problem you can have is cases where you're not strict enough and you're actually replacing more stuff than you'd like to replace or, in a find statement, you're actually catching things that you should not be catching. So, that's the biggest danger I would say to regular expressions. Make sure you're as strict as you can, unless you really want to pick up anything, even if it's garbage and in some cases you're going to want to do this. But, if you're doing something important and you want to, for example, extract first name and last name, you really want to make sure you're good on it, and if you're not sure about your match, you just leave it like it is.

 

What did you mention about including the statement 'drop re_:' ? Is that to remove the variable defining the regular expression?

No, actually that trick does not have anything to do with regular expressions, I've just used it there. Whenever you want to put a list of variables, such as in a drop statement where you want to drop several variables, there's a wild card you can use, and that's column. So, if you do drop space then some variable prefix column, it's going to drop from your output dataset all of the variables that started with that prefix. So, that's why I've used drop re_: column because I always name my regular expressions tags. I always call them re_, because I know I have no valid variable that is called re_.

 

Is Pro Reg Expression similar to Regex in SQL?

I am not familiar with these.  Sounds like a proprietary function or group of functions that are supported by a specific database (outside of SAS).  Could very well be similar to Perl Regex but I can’t say for sure.

 

Is SAS Regular Expression syntax very similar to others, such as Postgres, SQL Server, etc.?

It will be really like any other language that uses Perl regular expressions language.  Could defer if that other platform uses a different language.

 

How does telling something what you don't want .... telling or describing what you do want?

Not sure I understood the question right.  Matching patterns is all about looking for some key components and excluding others.  You could very likely explicit lists with square brackets and explicit exclude lists in the same regular expression depending on your specific needs.  You really should always tackle these problems by starting with a specific example just like I did with the phone number.  Then, while building that regular expression slowly, you’ll see what the best approach for you is (exclude lists for some characters, explicit lists or both).

 

How about telephone numbers that have letters? In the 60's and 70's I remember my parents phone number as; 914-EL6-6518?

For the exchange, instead of \d{3}, you could use [\dA-Z]{3} (assuming that the letter-number will always be upper case and can range from A to Z).

 

I had a character variable with leading blanks that I wanted to match text at the start of the string. I thought \A\W*?\w would work, but that did not. I ended up having to use the strip() function first on the variable. Is there a metacharacter that would have worked with the unstripped variable?

If you wish to match something with a regular expression while ignoring any number of spaces at the beginning, your regular expression should start with something like that: ^\s* .

The “^” sign outside of a list means that this is the start of your string and the \s* will match any number of leading spaces.  You can then add the rest of your regular expression.  For example, if you want to find strings of 10 numbers, your regular expression could be: /^\s*\d{10}/.

 

Is it possible that in the future that SAS will provide a way for us to enter the pattern of phone # such as 1-000-111-2222 and get script of regular expression code (in a reverse way) or a pattern of regular expression that you may use for modification?

I would doubt it because it becomes a question of regular expression at that point and not a SAS specific question.  Also, that would be a problem because given a specific expression, it is not clear what restrictions needs to be considered (example: dashes only for delimiters or not).

 

When you use the substr function on the left side, you need to tell it what the length of the replacement string is (and make sure the length for the new variable is long enough). Is there a need to do this with Perl regular expressions (i.e. specify the length of the replacement string)?

Yes, in my examples, I have set the length of my new variables with a length statement.  If you don’t, it most likely either picks the same length as the variable on which you are getting the information from or sets it to a default value.  I haven’t tested that as I always like to be explicit on the length of my new variables.

 

What if I need to capture a grouping with parenthesis surrounding each value in a Phone number For example:  (903) -(567) -(3456) ?

You can match for parentheses in regular expressions.  You simply have to use the escape character.  So for this example, if you 3 groups of numbers always have these parentheses, then you could use (\(\d{3}\)) for the area code instead of (\d{3}) that I used in my examples.

 

Does SAS indicate if a regular expression is not properly compiled - i.e. you did not form it correctly?

It should generate an error if your regular expression is not ok.  Keep in mind that if you make a mistake with the regular expression and it is still a valid expression, it won’t generate an error, you simply won’t have the matches that you are expecting.

 

For the second line in the table, I think it's better non-matching group: ^(?:nurse|doctor)*\s*(\w*),\s(\w*)\s* and replace $2 $1. Do you agree?

I am not familiar with that syntax but a quick test confirmed to me that it wouldn’t work.  The problem is that you could have “nursedoctor” as a title and it would go through.  In some cases, you might be ok with possible invalid matches but if you have a strict context, you will most likely want to leave the original data as is when the information might be invalid.

 

Instead of ending the regular expression with \s* to deal with the right padded spaces, I like to use the strip() function around the string to make sure that I only have the string to deal with. Would you prefer the use of \s*?

I like the \s* better but both are probably pretty much the same from a performance standpoint.  It then becomes really a question of personal taste at that point.

 

The greedy regex you were saying can be used where you can use "where" clauses in PROC SQL, do I have this correct?

No, the greedy vs non greedy have nothing to do with being able to use a regular expression (through a prxmatch statement) in a where clause.  You can use prxmatch in where clauses, you just have to put an explicit regular expression in the call as you cannot have a prxparse executed before (this is not a data step).

 

Which would be the best book to learn lookahead and lookbehing regexps?

I don’t know of regular expressions book.  There are obviously some out there but all that I learned, I’ve learned by trying things and looking at examples online.

 

Does compiling "on the fly" mean compiling for each row? Do you recommend using the prxparse with retain to avoid compiling for each row?

Really good question.  SAS possibly does some time saving steps behind the scenes.  It is possible that when we use prxparse with an explicit regex that it does still compile it only once.  However, as you might refer to a given regex more than once in a DATA Step, you should always use prxparse to ensure that you only write it down one time.  It will be easier to maintain as well.

 

What if there is more than one match after prxmatch function, what will be the output?

If you have more than one match, the function will only return the first match.  As the returned value is the position of the match, you could derive a substring of your original string from which you could run the match again (if you want to check for the same pattern farther in the string).

 

Recommended Resources

SAS 9 Regular Expressions Tip Sheet

Regular Expressions 101

SAS 9.4/Viya3.5 Documentation

 

Want more tips? Be sure to subscribe to the Ask the Expert board to receive follow up Q&A, slides and recordings from other SAS Ask the Expert webinars.  

Version history
Last update:
‎02-25-2022 08:32 AM
Updated by:

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Article Tags