Hello community,
I've been implementing some regex code to capture BIRAD scores (scores that assess risk of breast cancer that can range from 0-6) and some of them, unfortunately, are buried within free text notes. While I have regex code (bi.?rads?\D*(.|\D+)?\d*) working reasonably well I'm having difficulty limiting the return of the digit after the BIRAD keyword. Some examples:
1) BI-RAD category: 1 -- Regex code will capture entire string
2) BI-RADs 3 -- Regex code will capture entire string
3) BI-RADS CATEGORY EXPLANATION density date assessed: 9/24/2018 -- Regex captures up until the "9", or the September
I obviously do not want to capture the '9' in the 3rd example. My initial thought was to limit the # of words that occur after the BIRAD keyword using some kind of word boundary count but I've difficulty operationalizing that, and I'm probably not thinking of a simpler approach. Any advice? Code example is attached.
Thanks, Brian.
You could probaely use lookaorund. Here i just said a number followd by a periode.
data birad_q1;
input report_text $1-130;
cards;
BI-RADS: Post Procedure Mammograms for Marker Placement COMMENTS: None. Addendum: DATE ADDENDUM DICTATED: 12/16/2025
Overall Final Assessment: BI-RADS Category 3. Probably Benign.
Clinical assessment: BI-RADS score 4. Suspicious.
;
run;
data birad_q2;
set birad_q1;
/* Create pattern using PERL to grab the keyword and subsequent score */
text_out=prxparse("/bi.?rad(?=\s|s).+(\d{1,3})(?=\.)/i");
if prxmatch(text_out, report_text) then
do;
score=input(prxposn(text_out, 1, report_text),8.);
end;
run;
Why not check for (and capture) a space after the digit?
You could probaely use lookaorund. Here i just said a number followd by a periode.
data birad_q1;
input report_text $1-130;
cards;
BI-RADS: Post Procedure Mammograms for Marker Placement COMMENTS: None. Addendum: DATE ADDENDUM DICTATED: 12/16/2025
Overall Final Assessment: BI-RADS Category 3. Probably Benign.
Clinical assessment: BI-RADS score 4. Suspicious.
;
run;
data birad_q2;
set birad_q1;
/* Create pattern using PERL to grab the keyword and subsequent score */
text_out=prxparse("/bi.?rad(?=\s|s).+(\d{1,3})(?=\.)/i");
if prxmatch(text_out, report_text) then
do;
score=input(prxposn(text_out, 1, report_text),8.);
end;
run;
Another way if you really want to limit the word count:
(bi.?rads?(\s[^\s\d]){1,3}\d+)
but you still spend on the final digit(s).
Simpler would be keeping group 1 in something like:
(bi-?rads?.*?)(--|date).*
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.