BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
BrianB4233
Obsidian | Level 7

Hello community,

I've been implementing some regex code to capture BIRAD scores (scores that assess risk of breast cancer that can range from 0-6) and some of them, unfortunately, are buried within free text notes. While I have regex code (bi.?rads?\D*(.|\D+)?\d*) working reasonably well I'm having difficulty limiting the return of the digit after the BIRAD keyword. Some examples:

 

1) BI-RAD category: 1 -- Regex code will capture entire string

 

2) BI-RADs 3 -- Regex code will capture entire string

 

3) BI-RADS CATEGORY EXPLANATION density date assessed: 9/24/2018 -- Regex captures up until the "9", or the September

 

I obviously do not want to capture the '9' in the 3rd example. My initial thought was to limit the # of words that occur after the BIRAD keyword using some kind of word boundary count but I've difficulty operationalizing that, and I'm probably not thinking of a simpler approach. Any advice? Code example is attached.

 

Thanks, Brian.

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
rudfaden
Lapis Lazuli | Level 10

You could probaely use lookaorund. Here i just said a number followd by a periode.

 

data birad_q1;
	input report_text $1-130;
	cards;
BI-RADS: Post Procedure Mammograms for Marker Placement  COMMENTS:  None.  Addendum: DATE ADDENDUM DICTATED:  12/16/2025
Overall Final Assessment: BI-RADS Category 3.  Probably Benign.
Clinical assessment: BI-RADS score 4.  Suspicious.
;
run;

data birad_q2;
	set birad_q1;

	/* Create pattern using PERL to grab the keyword and subsequent score */
	text_out=prxparse("/bi.?rad(?=\s|s).+(\d{1,3})(?=\.)/i");

	if prxmatch(text_out, report_text) then
		do;
			score=input(prxposn(text_out, 1, report_text),8.);
		end;
run;

View solution in original post

4 REPLIES 4
ChrisNZ
Tourmaline | Level 20

Why not check for (and capture) a space after the digit?

rudfaden
Lapis Lazuli | Level 10

You could probaely use lookaorund. Here i just said a number followd by a periode.

 

data birad_q1;
	input report_text $1-130;
	cards;
BI-RADS: Post Procedure Mammograms for Marker Placement  COMMENTS:  None.  Addendum: DATE ADDENDUM DICTATED:  12/16/2025
Overall Final Assessment: BI-RADS Category 3.  Probably Benign.
Clinical assessment: BI-RADS score 4.  Suspicious.
;
run;

data birad_q2;
	set birad_q1;

	/* Create pattern using PERL to grab the keyword and subsequent score */
	text_out=prxparse("/bi.?rad(?=\s|s).+(\d{1,3})(?=\.)/i");

	if prxmatch(text_out, report_text) then
		do;
			score=input(prxposn(text_out, 1, report_text),8.);
		end;
run;
ChrisNZ
Tourmaline | Level 20

Another way if you really want to limit the word count:

(bi.?rads?(\s[^\s\d]){1,3}\d+)

but you still spend on the final digit(s).

 

Simpler would be keeping group 1 in something like:

(bi-?rads?.*?)(--|date).*

 

BrianB4233
Obsidian | Level 7
Thank you all for your input, it's greatly appreciated.

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 1521 views
  • 0 likes
  • 3 in conversation