Hi All,
I am currently working my way through someone else's code and have come across lines that identify whether a given string is present in two different ways:
psp=(prxmatch('/\bPSP\b/',charge)>0 | index(charge,'P.S.P.')>0);
punsh=(index(charge,"PUNISH")>0);
The use of index makes sense to me, but I am not sure what is going with prxmatch--particularly the slashes and the uses of 'b.' Can someone shed some light on this?
psp=(prxmatch('/\bPSP\b/',charge)>0 | index(charge,'P.S.P.')>0);
is a boolean expression resulting in values 0 or 1. The pipe | is the OR operator. Therefore there are 2 different expressions. If the word PSP(\b escape metacharacter for word search) is found in charge , it's true. If the string P.S.P is found in variable charge, it is again true. Basically the dual expression is evaluated to true with 1's and false with 0's. If at least one is true, it is true aka 1. HTH
Also, >0 is not really required in a boolean expression with an OR operator for the reason True/False evaluation would anyway result in 1 or 0 even if the position of the string or word occurs anywhere beyond position 1. Therefore,-
psp=(prxmatch('/\bPSP\b/',charge) | index(charge,'P.S.P.'));will suffice. However, for reading ease, it's probably good to have >0.
Thanks for the reply. I am still having trouble understanding the specifics of the /\b . . . \b/ notation. Also, a 0 or 1 is not necessarily returned, though, right? Since index and prxmatch return a position in the string?
"I am still having trouble understanding the specifics of the /\b . . . \b/ notation."- This is Regex metacharacter used to encapsulate a string to make that a word , rather than a string. If you are new to ReGeX, it's a bit of a learning curve. For example,
Look at the 7th record in the result of the below test and compare with 1st or 3rd one-
data have;
 input charge $20.;
 cards;
 bjbj PSP
 P.S.P
 dbsfjbwe PSP h
 v dhw P.S.P
 P.S.P. BJHB
 huhbk
 PSPppppppp
;
data want;
 set have;
 psp=(prxmatch('/\bPSP\b/',charge) | index(charge,'P.S.P.'));
run;
"Since index and prxmatch return a position in the string?"- Yes, however that's the definition of those functions though when used in an "boolean expression that uses AND/OR/NOT" the result of the expression is what is assigned to the assignment variable on the left hand side.
My apologies if I am not explaining well enough. I am useless when it comes to this. Requesting @Tom / @PaigeMiller or Mr Kolmogorov @FreelanceReinh their time for a more neat explanation. Thank you gentlemen in advance.
Hi @raivester,
As @novinosrin has already mentioned, the "\b" is a PRX metacharacter and represents a "word boundary," i.e., unlike index(charge,'PSP') the PRXMATCH criterion would not be satisfied if variable charge contained text like "TOPSPIN." A naive suggestion to find "PSP" only as a word may be to use index(charge,' PSP '). But this would fail in various circumstances, e.g., for texts in variable charge like
PSP The Party of SAS Programmers (PSP) won the election. The majority voted for the PSP.
However, for more specialized tools such as the FINDW function or PRXMATCH using the "\b" metacharacter these (and similar) examples are no problem because they recognize the beginning and the end of a string, parentheses and punctuation marks as word delimiters.
A Boolean expression equals 1 for TRUE and 0 for FALSE, e.g., if it is assigned to a numeric variable.
Not in your code, but in @novinosrin's simplified expression another fact comes into play: Numeric values can be used as Boolean expressions and are evaluated as FALSE (i.e. 0) if they are 0 or missing. In all other cases (i.e., positive or negative numbers) they are evaluated as TRUE (i.e. 1).
Another example (SAS log):
402 data _null_; 403 if (4 | 0) = 1 and (4 & 5) = 1 and (. & 5) = 0 then put 'OK'; 404 run; OK
Note, however, that a single numeric expression would not be interpreted as a Boolean expression: In your second example (punsh=...) the ">0" cannot be removed (in general) without changing the values assigned to punsh. Without the ">0", values (character positions) >1 would be possible.
For such simple regular expression, and if maintenance is an issue (as seems to be the case), you can probably use the FINDW function with similar results.
\b identifies a word boundary, just like FINDW looks for word delimiters. If your program is old enough, maybe FINDW was unavailable when it was written.
So in your case, they are probably interchangeable, and would make you more comfortable using that code.
An alternative is that you use the opportunity to familiarise yourself with regular expressions. They look ugly but are invaluable useful when manipulating text.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
