BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
BrianB4233
Obsidian | Level 7

Hello community,

I've been implementing some regex code to capture BIRAD scores (scores that assess risk of breast cancer that can range from 0-6) and some of them, unfortunately, are buried within free text notes. While I have regex code (bi.?rads?\D*(.|\D+)?\d*) working reasonably well I'm having difficulty limiting the return of the digit after the BIRAD keyword. Some examples:

 

1) BI-RAD category: 1 -- Regex code will capture entire string

 

2) BI-RADs 3 -- Regex code will capture entire string

 

3) BI-RADS CATEGORY EXPLANATION density date assessed: 9/24/2018 -- Regex captures up until the "9", or the September

 

I obviously do not want to capture the '9' in the 3rd example. My initial thought was to limit the # of words that occur after the BIRAD keyword using some kind of word boundary count but I've difficulty operationalizing that, and I'm probably not thinking of a simpler approach. Any advice? Code example is attached.

 

Thanks, Brian.

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
rudfaden
Pyrite | Level 9

You could probaely use lookaorund. Here i just said a number followd by a periode.

 

data birad_q1;
	input report_text $1-130;
	cards;
BI-RADS: Post Procedure Mammograms for Marker Placement  COMMENTS:  None.  Addendum: DATE ADDENDUM DICTATED:  12/16/2025
Overall Final Assessment: BI-RADS Category 3.  Probably Benign.
Clinical assessment: BI-RADS score 4.  Suspicious.
;
run;

data birad_q2;
	set birad_q1;

	/* Create pattern using PERL to grab the keyword and subsequent score */
	text_out=prxparse("/bi.?rad(?=\s|s).+(\d{1,3})(?=\.)/i");

	if prxmatch(text_out, report_text) then
		do;
			score=input(prxposn(text_out, 1, report_text),8.);
		end;
run;

View solution in original post

4 REPLIES 4
ChrisNZ
Tourmaline | Level 20

Why not check for (and capture) a space after the digit?

rudfaden
Pyrite | Level 9

You could probaely use lookaorund. Here i just said a number followd by a periode.

 

data birad_q1;
	input report_text $1-130;
	cards;
BI-RADS: Post Procedure Mammograms for Marker Placement  COMMENTS:  None.  Addendum: DATE ADDENDUM DICTATED:  12/16/2025
Overall Final Assessment: BI-RADS Category 3.  Probably Benign.
Clinical assessment: BI-RADS score 4.  Suspicious.
;
run;

data birad_q2;
	set birad_q1;

	/* Create pattern using PERL to grab the keyword and subsequent score */
	text_out=prxparse("/bi.?rad(?=\s|s).+(\d{1,3})(?=\.)/i");

	if prxmatch(text_out, report_text) then
		do;
			score=input(prxposn(text_out, 1, report_text),8.);
		end;
run;
ChrisNZ
Tourmaline | Level 20

Another way if you really want to limit the word count:

(bi.?rads?(\s[^\s\d]){1,3}\d+)

but you still spend on the final digit(s).

 

Simpler would be keeping group 1 in something like:

(bi-?rads?.*?)(--|date).*

 

BrianB4233
Obsidian | Level 7
Thank you all for your input, it's greatly appreciated.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 569 views
  • 0 likes
  • 3 in conversation