BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
BrianB4233
Obsidian | Level 7

Hi all,

I've written what I thought was some solid code to extract multiple dates, and the preceding 18 words/non-words, from a free text field. I'm using PRXNEXT b/c there are often multiple dates within the text field and I'd like to extract all of them. However, testing this in https://regex101.com/ and then viewing the results doesn't result in a match. It is correctly identifying, and outputting, the date using PRXPOSN but it's not including all of the words/non-words preceding the date.

 

What is being output in the temp dataset is this:

year: 2.9 %..........[average
woman <1.67%]
NCI Lifetime: 15.1 %..........[average
woman <10%]

A

Whereas in regex101 it's showing this: https://regex101.com/r/LWRcqN/1

 

data data_chk1;

length dt_1-dt_12 $150 
dt_out1-dt_out12 $30
imp_rep_concat $11000
;

set work.birad_score_0_3(drop=cht_in impressiontext reporttext obs=max);

/* Combine impression & report text together to search as one */
imp_rep_concat = catx(' REPORT_TEXT ',impression_copy,report_copy);

*** Identifies ddOctdd or dOctdddd or ddOctdddd as well if there is a
space/hyphen/whatever between the day & month or month & year;
if _n_ = 1 then do;
retain dt_pattern;
 dt_pattern = prxparse("/(?:\w+\W+){0,18}(\d{1,2}(\.|\/|-)\d{1,2}(\.|\/|-)\d{2,4})/i");
end;

/*if prxmatch(dt_pattern,impression_copy) then do;*/
/*match = 1;*/
/* date_out = prxposn(dt_pattern,1,impression_copy);*/
/*end;*/

start = 1;
stop = length(imp_rep_concat);

call prxnext(dt_pattern,start,stop,imp_rep_concat,pos,len);
	array comm[12] $dt_1-dt_12;
	array comm1[12] $dt_out1-dt_out12;
	do i = 1 to 12 while (pos > 0);
		comm(i) = substr(imp_rep_concat,pos,len);
		comm1(i) = prxPosn(dt_pattern, 1, imp_rep_concat);
 call prxnext(dt_pattern,start,stop,imp_rep_concat,pos,len);
end;


*drop dt_1-dt_12 dt_pattern: start: stop: pos len i;
run;

Any ideas what is causing the inconsistency? Thank you.

1 ACCEPTED SOLUTION

Accepted Solutions
ChrisNZ
Tourmaline | Level 20

1. The link you sent uses extensions gmi. SAS does not support option g, and you only use i in your code.

2. In any case only using i gives the same result on regex101.com

3. The result is the same in SAS and regex101 except that the SAS result is truncated at length 200. Try lengthening the variable. 

 

View solution in original post

2 REPLIES 2
ChrisNZ
Tourmaline | Level 20

1. The link you sent uses extensions gmi. SAS does not support option g, and you only use i in your code.

2. In any case only using i gives the same result on regex101.com

3. The result is the same in SAS and regex101 except that the SAS result is truncated at length 200. Try lengthening the variable. 

 

BrianB4233
Obsidian | Level 7
Thank you ChrisNZ - not sure how I missed this but it was indeed #3.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 2 replies
  • 627 views
  • 0 likes
  • 2 in conversation