BookmarkSubscribeRSS Feed
BrianB4233
Obsidian | Level 7

I'm searching through medical notes to capture all instances of a phrase, in particular 'carbapenemase producing'. At times this phrasing can occur > 1 time in a string. I've been working with PRXNEXT, which I think is most applicable. As an example for this string:

If amikacin results are needed, please notify
Microbiology Lab at ext.xxxx for further testing.
The organism will be held until x/xx/xx.
Presumptive Carbapenemase Producing CRE
See SPMI34 for Carba-R PCR Results
Not Confirmed Carbapenemase Producing CRE

From this comment above, I'd like to extract the phrases

Presumptive Carbapenemase Producing 

and

Not Confirmed Carbapenemase Producing

The code I've been using is here, and still in development:

data chk_one;

set a01;

prx = prxparse('/((not confirmed\s*)?(ca[bepr]\w+ prod\w+))/');

_start_inout = 1;

do hitnum = 1 by 1 until (pos=0);

call prxnext (prx, _start_inout, length(as_comments), as_comments, pos, len);

if len then do;

 content = substr(as_comments,pos,len);

end;

end;

 

run;

 

I'm able to generate the 2nd phrase "Not Confirmed Carbapenemase Producing" but the 1st one is a work in progress. Any help/advice would be appreciated.

6 REPLIES 6
PGStats
Opal | Level 21

I would use :

 

'/(not confirmed|\w+)\s+carbapenemases? producing/i'

 

notes: The ? makes plural optional. The i at the end makes the match case insensitive.

PG
BrianB4233
Obsidian | Level 7

Thanks so much for everyone's reply - it's genuinely appreciated. 

 

PGStats - thank you for your explanation regarding the ? and i. The reason I'm using ca[bepr]\w+ is that there's 15+ derivations of the word 'carbapenemase', i.e., it's rife for being misspelled. I'm particularly interested in (not confirmed|\w+) - can I look back 2 or even 3 words from 'carbapenemase'?

 

Thanks again, Brian

PGStats
Opal | Level 21

If you know what words you are looking for:

 

'/(three word prefix|not confirmed|\w+)\s+carbapenemases? producing/i'

 

If you want any three words :

 

'/(\w+\W+){1,3}carbapenemases? producing/i'

 

Note: using \W instead of \s will allow words separated by any non-word characters, including spaces or punctuation.

PG
Patrick
Opal | Level 21

On top of what @PGStats writes.

1. Compile the RegEx only once. If you don't then you'll compile a separate RegEx in every iteration of your data step (variable PRX will then have a different value in every iteration - the "pointer" to the compiled RegEx stored in memory).

...
retain prx;
if _n_=1 then 
  do;
    prx=prxparse(....
  end;
....

Given that you are using PRXNEXT(): Wouldn't you need somewhere in your loop an output statement?

s_lassen
Meteorite | Level 14

@Patrick wrote:

1. Compile the RegEx only once.

 

PRXPARSE expressions are not recompiled unless they contain variable strings. In this case the string is constant, and the PRX pointer will remain the same.

 

If you use a variable in the expression, but only want to compile it once, you can use the 'o' directive at the end of the string, e.g.:

prx = prxparse(cats('/((not confirmed\s*)?(ca[bepr]\w+ prod\w+',testString,'))/o'));

but even that is not necessary if the string is not variable.

 

Patrick
Opal | Level 21

@s_lassen 

You are right! It definitely doesn't get recompiled clogging up memory.

I've run out of curiosity below code. Looks like using a retained variable still provides a small performance gain.

options fullstimer;
data have;
  do obs=1 to 100000000;
    output;;
  end;
  stop;
run;

data _null_;
  set have;
  prx = prxparse(cats('/((not confirmed\s*)?(ca[bepr]\w+ prod\w+))/o'));
  output;
run;

data _null_;
  set have;
  retain prx;
  if _n_=1 then
    prx = prxparse(cats('/((not confirmed\s*)?(ca[bepr]\w+ prod\w+))/o'));
  output;
run;
      real time           6.61 seconds
      user cpu time       6.44 seconds
      system cpu time     0.16 seconds
      memory              1324.57k
      OS Memory           21664.00k


      real time           3.21 seconds
      user cpu time       3.07 seconds
      system cpu time     0.15 seconds
      memory              1304.21k
      OS Memory           21664.00k

 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 6 replies
  • 1947 views
  • 5 likes
  • 4 in conversation