BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
PharmlyDoc
Quartz | Level 8

Are there alternatives to negative lookbehind regular expressions with variable lengths? 

 

Do not match if "no" or "negative" are before certain key words (infarct)|(MI)|(myocardial infarction).

 

SAS returns the following error: Variable length lookbehind not implemented before HERE mark in regex

 

Thanks.

 

data text;
input string $60.;
cards;
Acute infarct.
Acute MI. 
Acute myocardial infarction. 
Extensive infarct found in the.
Massive infarct found in the.
No infarct. 
Negative infarct.
Negative for infarct.
No CT evidence of infarct.
No apparent evidence of infarct.
No apparent CT evidence of infarct. 
No evidence of infarct.
No evidence of infarct.
Evidence of infarct.
Evidence for infarct. 
;

/* Only Match: 
Acute infarct
Acute MI
Acute myocardial infarction
Extensive infarct found in the
Massive infarct found in the
evidence of infarct  
evidence for infarct
*/

data text_prx;
 set text;
 if _n_=1 then do; 
 retain re; 
 re = prxparse('/(?<!\bno\b.{0,225}|\bnegative\b.{0,225})((infarct)|(MI)|(myocardial infarction))/i'); 
 putlog 'ERROR: regex is malformed'; 
 stop; 
 end; 
 end;

 if prxmatch(re,string) then infarct=1; 
 else infarct=0;
 run;

proc print data=text_prx(drop=re);
 run;
1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Hello @PharmlyDoc,

 

How about searching for the positive and negative words separately and then comparing character positions?

data text_prx(drop=re_: pos_:);
set text;
if _n_=1 then do; 
  re_i + prxparse('/(infarct)|(\bMI\b)|(myocardial infarction)/i'); 
  re_n + prxparse('/\bno\b|\bnegative\b/i'); 
end;
pos_i = prxmatch(re_i,string);
pos_n = prxmatch(re_n,string);
infarct = pos_i & not (0<pos_n<pos_i); 
run;

(Of course, "infarct" is a substring of "myocardial infarction", but I assume that you may want to evaluate the matches, e.g., with PRXPOSN.)

View solution in original post

5 REPLIES 5
FreelanceReinh
Jade | Level 19

Hello @PharmlyDoc,

 

How about searching for the positive and negative words separately and then comparing character positions?

data text_prx(drop=re_: pos_:);
set text;
if _n_=1 then do; 
  re_i + prxparse('/(infarct)|(\bMI\b)|(myocardial infarction)/i'); 
  re_n + prxparse('/\bno\b|\bnegative\b/i'); 
end;
pos_i = prxmatch(re_i,string);
pos_n = prxmatch(re_n,string);
infarct = pos_i & not (0<pos_n<pos_i); 
run;

(Of course, "infarct" is a substring of "myocardial infarction", but I assume that you may want to evaluate the matches, e.g., with PRXPOSN.)

PGStats
Opal | Level 21

Note, if you need to know which of many possible substrings was matched, (using PRXPOSN) you should list the longer substrings first

if _n_=1 then do; 
  re_i + prxparse('/\b(myocardial infarction|infarct|MI)\b/i'); 
  re_n + prxparse('/\bno\b|\bnegative\b/i'); 
end;
pos_i = prxmatch(re_i,string);
pos_n = prxmatch(re_n,string);
infarct = pos_i & not (0<pos_n<pos_i); 
if infarct then word = prxposn(re_i, 1, string);

 

PG
FreelanceReinh
Jade | Level 19

Thanks, @PGStats, for chiming in. I was under the impression that "myocardial infarction" is matched first in strings like "Acute myocardial infarction" even if listed after "infarct" in the regular expression because it starts earlier in the string. However, if the regex was /(infarction|infarct)/i, then the order of the two words would make a difference for PRXPOSN. (My guess that PRXPOSN might be involved somewhere else in the OP's code was just based on the way the regex was written.)

PharmlyDoc
Quartz | Level 8

@FreelanceReinh 

Thanks for helping with this.

Why use 

 

 

re_i + prxparse('/ /');

 

 

 instead of 

re_i = prxparse('/ /');

?

 

I would still use your method even if SAS provided a flavor of regex that allows for variable length negative lookbehind. This is an excellent workaround!!

Yes, I've found the prxposn function helpful for seeing what is being captured/matched. 

FreelanceReinh
Jade | Level 19

You're welcome.


@PharmlyDoc wrote:

Why use 

re_i + prxparse('/ /');

 instead of 

re_i = prxparse('/ /');

?


Just to save the RETAIN statement. The sum statement implies RETAIN and the result in re_i and re_n, respectively, is the same as with an assignment statement. (I learned this application of the sum statement to regex definitions from PGStats, years ago.)

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 1205 views
  • 2 likes
  • 3 in conversation