Solved: Re: Limiting # of words in a Regex

BrianB4233 · Posted 06-21-2021 05:13 PM

Hello community,

I've been implementing some regex code to capture BIRAD scores (scores that assess risk of breast cancer that can range from 0-6) and some of them, unfortunately, are buried within free text notes. While I have regex code (bi.?rads?\D*(.|\D+)?\d*) working reasonably well I'm having difficulty limiting the return of the digit after the BIRAD keyword. Some examples:

1) BI-RAD category: 1 -- Regex code will capture entire string

2) BI-RADs 3 -- Regex code will capture entire string

3) BI-RADS CATEGORY EXPLANATION density date assessed: 9/24/2018 -- Regex captures up until the "9", or the September

I obviously do not want to capture the '9' in the 3rd example. My initial thought was to limit the # of words that occur after the BIRAD keyword using some kind of word boundary count but I've difficulty operationalizing that, and I'm probably not thinking of a simpler approach. Any advice? Code example is attached.

Thanks, Brian.

rudfaden · Posted 06-22-2021 05:58 AM

You could probaely use lookaorund. Here i just said a number followd by a periode.

data birad_q1;
	input report_text $1-130;
	cards;
BI-RADS: Post Procedure Mammograms for Marker Placement  COMMENTS:  None.  Addendum: DATE ADDENDUM DICTATED:  12/16/2025
Overall Final Assessment: BI-RADS Category 3.  Probably Benign.
Clinical assessment: BI-RADS score 4.  Suspicious.
;
run;

data birad_q2;
	set birad_q1;

	/* Create pattern using PERL to grab the keyword and subsequent score */
	text_out=prxparse("/bi.?rad(?=\s|s).+(\d{1,3})(?=\.)/i");

	if prxmatch(text_out, report_text) then
		do;
			score=input(prxposn(text_out, 1, report_text),8.);
		end;
run;

View solution in original post

ChrisNZ · Posted 06-22-2021 02:22 AM

Why not check for (and capture) a space after the digit?

High-Performance SAS Coding - Third Edition

rudfaden · Posted 06-22-2021 05:58 AM

You could probaely use lookaorund. Here i just said a number followd by a periode.

data birad_q1;
	input report_text $1-130;
	cards;
BI-RADS: Post Procedure Mammograms for Marker Placement  COMMENTS:  None.  Addendum: DATE ADDENDUM DICTATED:  12/16/2025
Overall Final Assessment: BI-RADS Category 3.  Probably Benign.
Clinical assessment: BI-RADS score 4.  Suspicious.
;
run;

data birad_q2;
	set birad_q1;

	/* Create pattern using PERL to grab the keyword and subsequent score */
	text_out=prxparse("/bi.?rad(?=\s|s).+(\d{1,3})(?=\.)/i");

	if prxmatch(text_out, report_text) then
		do;
			score=input(prxposn(text_out, 1, report_text),8.);
		end;
run;

ChrisNZ · Posted 06-22-2021 06:32 AM

Another way if you really want to limit the word count:

(bi.?rads?(\s[^\s\d]){1,3}\d+)

but you still spend on the final digit(s).

Simpler would be keeping group 1 in something like:

(bi-?rads?.*?)(--|date).*

High-Performance SAS Coding - Third Edition

BrianB4233 · Posted 06-23-2021 12:19 PM

Thank you all for your input, it's greatly appreciated.

Limiting # of words in a Regex

Re: Limiting # of words in a Regex

Re: Limiting # of words in a Regex

Re: Limiting # of words in a Regex

Re: Limiting # of words in a Regex

Re: Limiting # of words in a Regex

Registration is open

SAS Training: Just a Click Away