BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
BrianB4233
Obsidian | Level 7

Hello community,

I've been implementing some regex code to capture BIRAD scores (scores that assess risk of breast cancer that can range from 0-6) and some of them, unfortunately, are buried within free text notes. While I have regex code (bi.?rads?\D*(.|\D+)?\d*) working reasonably well I'm having difficulty limiting the return of the digit after the BIRAD keyword. Some examples:

 

1) BI-RAD category: 1 -- Regex code will capture entire string

 

2) BI-RADs 3 -- Regex code will capture entire string

 

3) BI-RADS CATEGORY EXPLANATION density date assessed: 9/24/2018 -- Regex captures up until the "9", or the September

 

I obviously do not want to capture the '9' in the 3rd example. My initial thought was to limit the # of words that occur after the BIRAD keyword using some kind of word boundary count but I've difficulty operationalizing that, and I'm probably not thinking of a simpler approach. Any advice? Code example is attached.

 

Thanks, Brian.

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
rudfaden
Lapis Lazuli | Level 10

You could probaely use lookaorund. Here i just said a number followd by a periode.

 

data birad_q1;
	input report_text $1-130;
	cards;
BI-RADS: Post Procedure Mammograms for Marker Placement  COMMENTS:  None.  Addendum: DATE ADDENDUM DICTATED:  12/16/2025
Overall Final Assessment: BI-RADS Category 3.  Probably Benign.
Clinical assessment: BI-RADS score 4.  Suspicious.
;
run;

data birad_q2;
	set birad_q1;

	/* Create pattern using PERL to grab the keyword and subsequent score */
	text_out=prxparse("/bi.?rad(?=\s|s).+(\d{1,3})(?=\.)/i");

	if prxmatch(text_out, report_text) then
		do;
			score=input(prxposn(text_out, 1, report_text),8.);
		end;
run;

View solution in original post

4 REPLIES 4
ChrisNZ
Tourmaline | Level 20

Why not check for (and capture) a space after the digit?

rudfaden
Lapis Lazuli | Level 10

You could probaely use lookaorund. Here i just said a number followd by a periode.

 

data birad_q1;
	input report_text $1-130;
	cards;
BI-RADS: Post Procedure Mammograms for Marker Placement  COMMENTS:  None.  Addendum: DATE ADDENDUM DICTATED:  12/16/2025
Overall Final Assessment: BI-RADS Category 3.  Probably Benign.
Clinical assessment: BI-RADS score 4.  Suspicious.
;
run;

data birad_q2;
	set birad_q1;

	/* Create pattern using PERL to grab the keyword and subsequent score */
	text_out=prxparse("/bi.?rad(?=\s|s).+(\d{1,3})(?=\.)/i");

	if prxmatch(text_out, report_text) then
		do;
			score=input(prxposn(text_out, 1, report_text),8.);
		end;
run;
ChrisNZ
Tourmaline | Level 20

Another way if you really want to limit the word count:

(bi.?rads?(\s[^\s\d]){1,3}\d+)

but you still spend on the final digit(s).

 

Simpler would be keeping group 1 in something like:

(bi-?rads?.*?)(--|date).*

 

BrianB4233
Obsidian | Level 7
Thank you all for your input, it's greatly appreciated.

sas-innovate-2026-white.png



April 27 – 30 | Gaylord Texan | Grapevine, Texas

Registration is open

Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!

Register now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 1662 views
  • 0 likes
  • 3 in conversation