Hello community,
I've been implementing some regex code to capture BIRAD scores (scores that assess risk of breast cancer that can range from 0-6) and some of them, unfortunately, are buried within free text notes. While I have regex code (bi.?rads?\D*(.|\D+)?\d*) working reasonably well I'm having difficulty limiting the return of the digit after the BIRAD keyword. Some examples:
1) BI-RAD category: 1 -- Regex code will capture entire string
2) BI-RADs 3 -- Regex code will capture entire string
3) BI-RADS CATEGORY EXPLANATION density date assessed: 9/24/2018 -- Regex captures up until the "9", or the September
I obviously do not want to capture the '9' in the 3rd example. My initial thought was to limit the # of words that occur after the BIRAD keyword using some kind of word boundary count but I've difficulty operationalizing that, and I'm probably not thinking of a simpler approach. Any advice? Code example is attached.
Thanks, Brian.
You could probaely use lookaorund. Here i just said a number followd by a periode.
data birad_q1;
input report_text $1-130;
cards;
BI-RADS: Post Procedure Mammograms for Marker Placement COMMENTS: None. Addendum: DATE ADDENDUM DICTATED: 12/16/2025
Overall Final Assessment: BI-RADS Category 3. Probably Benign.
Clinical assessment: BI-RADS score 4. Suspicious.
;
run;
data birad_q2;
set birad_q1;
/* Create pattern using PERL to grab the keyword and subsequent score */
text_out=prxparse("/bi.?rad(?=\s|s).+(\d{1,3})(?=\.)/i");
if prxmatch(text_out, report_text) then
do;
score=input(prxposn(text_out, 1, report_text),8.);
end;
run;
Why not check for (and capture) a space after the digit?
You could probaely use lookaorund. Here i just said a number followd by a periode.
data birad_q1;
input report_text $1-130;
cards;
BI-RADS: Post Procedure Mammograms for Marker Placement COMMENTS: None. Addendum: DATE ADDENDUM DICTATED: 12/16/2025
Overall Final Assessment: BI-RADS Category 3. Probably Benign.
Clinical assessment: BI-RADS score 4. Suspicious.
;
run;
data birad_q2;
set birad_q1;
/* Create pattern using PERL to grab the keyword and subsequent score */
text_out=prxparse("/bi.?rad(?=\s|s).+(\d{1,3})(?=\.)/i");
if prxmatch(text_out, report_text) then
do;
score=input(prxposn(text_out, 1, report_text),8.);
end;
run;
Another way if you really want to limit the word count:
(bi.?rads?(\s[^\s\d]){1,3}\d+)
but you still spend on the final digit(s).
Simpler would be keeping group 1 in something like:
(bi-?rads?.*?)(--|date).*
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.