BookmarkSubscribeRSS Feed
raivester
Quartz | Level 8

Hi All,

 

I am currently working my way through someone else's code and have come across lines that identify whether a given string is present in two different ways:

 

psp=(prxmatch('/\bPSP\b/',charge)>0 | index(charge,'P.S.P.')>0);
punsh=(index(charge,"PUNISH")>0);

The use of index makes sense to me, but I am not sure what is going with prxmatch--particularly the slashes and the uses of 'b.' Can someone shed some light on this?

5 REPLIES 5
novinosrin
Tourmaline | Level 20
psp=(prxmatch('/\bPSP\b/',charge)>0 | index(charge,'P.S.P.')>0);

is a boolean expression resulting  in values 0 or 1.  The pipe | is the OR operator. Therefore there are 2 different expressions. If the word PSP(\b escape metacharacter for word search) is found in charge , it's true. If the string P.S.P is found in variable charge, it is again true. Basically the dual expression is evaluated to true with 1's and false with 0's. If at least one is true, it is true aka 1. HTH

 

Also, >0 is not really required in a boolean expression with an OR operator for the reason True/False evaluation would anyway result in 1 or 0 even if the position of the string or word occurs anywhere beyond position 1. Therefore,-

psp=(prxmatch('/\bPSP\b/',charge) | index(charge,'P.S.P.'));

will suffice. However, for reading ease, it's probably good to have >0. 

raivester
Quartz | Level 8

Thanks for the reply. I am still having trouble understanding the specifics of the /\b . . . \b/ notation. Also, a 0 or 1 is not necessarily returned, though, right? Since index and prxmatch return a position in the string?

novinosrin
Tourmaline | Level 20

"I am still having trouble understanding the specifics of the /\b . . . \b/ notation."-  This is Regex metacharacter used to encapsulate a string to make that a word , rather than a string. If you are new to ReGeX, it's a bit of a learning curve. For example, 

Look at the 7th record in the result of the below test and compare with 1st or 3rd one-


data have;
 input charge $20.;
 cards;
 bjbj PSP
 P.S.P
 dbsfjbwe PSP h
 v dhw P.S.P
 P.S.P. BJHB
 huhbk
 PSPppppppp
;
data want;
 set have;
 psp=(prxmatch('/\bPSP\b/',charge) | index(charge,'P.S.P.'));
run;

"Since index and prxmatch return a position in the string?"- Yes, however that's the definition of those functions though when used in an "boolean expression that uses AND/OR/NOT" the result of the expression is what is assigned to the assignment variable on the left hand side. 

 

My apologies if I am not explaining well enough. I am useless when it comes to this.  Requesting @Tom / @PaigeMiller  or Mr Kolmogorov @FreelanceReinh  their time for a more neat explanation. Thank you gentlemen in advance.

 

FreelanceReinh
Jade | Level 19

Hi @raivester,

 

As @novinosrin has already mentioned, the "\b" is a PRX metacharacter and represents a "word boundary," i.e., unlike index(charge,'PSP') the PRXMATCH criterion would not be satisfied if variable charge contained text like "TOPSPIN." A naive suggestion to find "PSP" only as a word may be to use index(charge,' PSP '). But this would fail in various circumstances, e.g., for texts in variable charge like

PSP
The Party of SAS Programmers (PSP) won the election.
The majority voted for the PSP.

However, for more specialized tools such as the FINDW function or PRXMATCH using the "\b" metacharacter these (and similar) examples are no problem because they recognize the beginning and the end of a string, parentheses and punctuation marks as word delimiters.

 

A Boolean expression equals 1 for TRUE and 0 for FALSE, e.g., if it is assigned to a numeric variable.

 

Not in your code, but in @novinosrin's simplified expression another fact comes into play: Numeric values can be used as Boolean expressions and are evaluated as FALSE (i.e. 0) if they are 0 or missing. In all other cases (i.e., positive or negative numbers) they are evaluated as TRUE (i.e. 1).

 

Another example (SAS log):

402  data _null_;
403  if (4 | 0) = 1 and (4 & 5) = 1 and (. & 5) = 0 then put 'OK';
404  run;

OK

Note, however, that a single numeric expression would not be interpreted as a Boolean expression: In your second example (punsh=...) the ">0" cannot be removed (in general) without changing the values assigned to punsh. Without the ">0", values (character positions) >1 would be possible.

ChrisNZ
Tourmaline | Level 20

For such simple regular expression, and if maintenance is an issue (as seems to be the case), you can probably use the FINDW function with similar results.

\b identifies a word boundary, just like FINDW looks for word delimiters. If your program is old enough, maybe FINDW was unavailable when it was written.

So in your case, they are probably interchangeable, and would make you more comfortable using that code.

An alternative is that you use the opportunity to familiarise yourself with regular expressions. They look ugly but are invaluable useful when manipulating text.

 

 

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 1415 views
  • 0 likes
  • 4 in conversation