Hi all,
I've written what I thought was some solid code to extract multiple dates, and the preceding 18 words/non-words, from a free text field. I'm using PRXNEXT b/c there are often multiple dates within the text field and I'd like to extract all of them. However, testing this in https://regex101.com/ and then viewing the results doesn't result in a match. It is correctly identifying, and outputting, the date using PRXPOSN but it's not including all of the words/non-words preceding the date.
What is being output in the temp dataset is this:
year: 2.9 %..........[average
woman <1.67%]
NCI Lifetime: 15.1 %..........[average
woman <10%]
A
Whereas in regex101 it's showing this: https://regex101.com/r/LWRcqN/1
data data_chk1;
length dt_1-dt_12 $150
dt_out1-dt_out12 $30
imp_rep_concat $11000
;
set work.birad_score_0_3(drop=cht_in impressiontext reporttext obs=max);
/* Combine impression & report text together to search as one */
imp_rep_concat = catx(' REPORT_TEXT ',impression_copy,report_copy);
*** Identifies ddOctdd or dOctdddd or ddOctdddd as well if there is a
space/hyphen/whatever between the day & month or month & year;
if _n_ = 1 then do;
retain dt_pattern;
dt_pattern = prxparse("/(?:\w+\W+){0,18}(\d{1,2}(\.|\/|-)\d{1,2}(\.|\/|-)\d{2,4})/i");
end;
/*if prxmatch(dt_pattern,impression_copy) then do;*/
/*match = 1;*/
/* date_out = prxposn(dt_pattern,1,impression_copy);*/
/*end;*/
start = 1;
stop = length(imp_rep_concat);
call prxnext(dt_pattern,start,stop,imp_rep_concat,pos,len);
array comm[12] $dt_1-dt_12;
array comm1[12] $dt_out1-dt_out12;
do i = 1 to 12 while (pos > 0);
comm(i) = substr(imp_rep_concat,pos,len);
comm1(i) = prxPosn(dt_pattern, 1, imp_rep_concat);
call prxnext(dt_pattern,start,stop,imp_rep_concat,pos,len);
end;
*drop dt_1-dt_12 dt_pattern: start: stop: pos len i;
run;
Any ideas what is causing the inconsistency? Thank you.
1. The link you sent uses extensions gmi. SAS does not support option g, and you only use i in your code.
2. In any case only using i gives the same result on regex101.com
3. The result is the same in SAS and regex101 except that the SAS result is truncated at length 200. Try lengthening the variable.
1. The link you sent uses extensions gmi. SAS does not support option g, and you only use i in your code.
2. In any case only using i gives the same result on regex101.com
3. The result is the same in SAS and regex101 except that the SAS result is truncated at length 200. Try lengthening the variable.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.