Hi all, I've written what I thought was some solid code to extract multiple dates, and the preceding 18 words/non-words, from a free text field. I'm using PRXNEXT b/c there are often multiple dates within the text field and I'd like to extract all of them. However, testing this in https://regex101.com/ and then viewing the results doesn't result in a match. It is correctly identifying, and outputting, the date using PRXPOSN but it's not including all of the words/non-words preceding the date. What is being output in the temp dataset is this: year: 2.9 %..........[average woman <1.67%] NCI Lifetime: 15.1 %..........[average woman <10%] A Whereas in regex101 it's showing this: https://regex101.com/r/LWRcqN/1 data data_chk1;
length dt_1-dt_12 $150
dt_out1-dt_out12 $30
imp_rep_concat $11000
;
set work.birad_score_0_3(drop=cht_in impressiontext reporttext obs=max);
/* Combine impression & report text together to search as one */
imp_rep_concat = catx(' REPORT_TEXT ',impression_copy,report_copy);
*** Identifies ddOctdd or dOctdddd or ddOctdddd as well if there is a
space/hyphen/whatever between the day & month or month & year;
if _n_ = 1 then do;
retain dt_pattern;
dt_pattern = prxparse("/(?:\w+\W+){0,18}(\d{1,2}(\.|\/|-)\d{1,2}(\.|\/|-)\d{2,4})/i");
end;
/*if prxmatch(dt_pattern,impression_copy) then do;*/
/*match = 1;*/
/* date_out = prxposn(dt_pattern,1,impression_copy);*/
/*end;*/
start = 1;
stop = length(imp_rep_concat);
call prxnext(dt_pattern,start,stop,imp_rep_concat,pos,len);
array comm[12] $dt_1-dt_12;
array comm1[12] $dt_out1-dt_out12;
do i = 1 to 12 while (pos > 0);
comm(i) = substr(imp_rep_concat,pos,len);
comm1(i) = prxPosn(dt_pattern, 1, imp_rep_concat);
call prxnext(dt_pattern,start,stop,imp_rep_concat,pos,len);
end;
*drop dt_1-dt_12 dt_pattern: start: stop: pos len i;
run; Any ideas what is causing the inconsistency? Thank you.
... View more