I'm searching through medical notes to capture all instances of a phrase, in particular 'carbapenemase producing'. At times this phrasing can occur > 1 time in a string. I've been working with PRXNEXT, which I think is most applicable. As an example for this string:
If amikacin results are needed, please notify
Microbiology Lab at ext.xxxx for further testing.
The organism will be held until x/xx/xx.
Presumptive Carbapenemase Producing CRE
See SPMI34 for Carba-R PCR Results
Not Confirmed Carbapenemase Producing CRE
From this comment above, I'd like to extract the phrases
Presumptive Carbapenemase Producing
and
Not Confirmed Carbapenemase Producing
The code I've been using is here, and still in development:
data chk_one;
set a01;
prx = prxparse('/((not confirmed\s*)?(ca[bepr]\w+ prod\w+))/');
_start_inout = 1;
do hitnum = 1 by 1 until (pos=0);
call prxnext (prx, _start_inout, length(as_comments), as_comments, pos, len);
if len then do;
content = substr(as_comments,pos,len);
end;
end;
run;
I'm able to generate the 2nd phrase "Not Confirmed Carbapenemase Producing" but the 1st one is a work in progress. Any help/advice would be appreciated.
I would use :
'/(not confirmed|\w+)\s+carbapenemases? producing/i'
notes: The ? makes plural optional. The i at the end makes the match case insensitive.
Thanks so much for everyone's reply - it's genuinely appreciated.
PGStats - thank you for your explanation regarding the ? and i. The reason I'm using ca[bepr]\w+ is that there's 15+ derivations of the word 'carbapenemase', i.e., it's rife for being misspelled. I'm particularly interested in (not confirmed|\w+) - can I look back 2 or even 3 words from 'carbapenemase'?
Thanks again, Brian
If you know what words you are looking for:
'/(three word prefix|not confirmed|\w+)\s+carbapenemases? producing/i'
If you want any three words :
'/(\w+\W+){1,3}carbapenemases? producing/i'
Note: using \W instead of \s will allow words separated by any non-word characters, including spaces or punctuation.
On top of what @PGStats writes.
1. Compile the RegEx only once. If you don't then you'll compile a separate RegEx in every iteration of your data step (variable PRX will then have a different value in every iteration - the "pointer" to the compiled RegEx stored in memory).
... retain prx; if _n_=1 then do; prx=prxparse(.... end; ....
Given that you are using PRXNEXT(): Wouldn't you need somewhere in your loop an output statement?
@Patrick wrote:
1. Compile the RegEx only once.
PRXPARSE expressions are not recompiled unless they contain variable strings. In this case the string is constant, and the PRX pointer will remain the same.
If you use a variable in the expression, but only want to compile it once, you can use the 'o' directive at the end of the string, e.g.:
prx = prxparse(cats('/((not confirmed\s*)?(ca[bepr]\w+ prod\w+',testString,'))/o'));
but even that is not necessary if the string is not variable.
You are right! It definitely doesn't get recompiled clogging up memory.
I've run out of curiosity below code. Looks like using a retained variable still provides a small performance gain.
options fullstimer;
data have;
do obs=1 to 100000000;
output;;
end;
stop;
run;
data _null_;
set have;
prx = prxparse(cats('/((not confirmed\s*)?(ca[bepr]\w+ prod\w+))/o'));
output;
run;
data _null_;
set have;
retain prx;
if _n_=1 then
prx = prxparse(cats('/((not confirmed\s*)?(ca[bepr]\w+ prod\w+))/o'));
output;
run;
real time 6.61 seconds user cpu time 6.44 seconds system cpu time 0.16 seconds memory 1324.57k OS Memory 21664.00k real time 3.21 seconds user cpu time 3.07 seconds system cpu time 0.15 seconds memory 1304.21k OS Memory 21664.00k
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.