Solved: Extract multiple substrings from string using Regex

Matos · Posted 05-03-2020 06:02 AM

Dear all,

I am new with Regex and I would like to use it to extract from a string all substrings that match a specific pattern. The output should display, in a new column (called extraction), all matching substrings together separated by a space. Below you can find the initial dataset.

data have;

input text & $200.;

cards;

{Z(A_15.01.n)+{P(Z_18.10.c)999(Y_18.04.a)B(999)} + {T(X_20.99.d)}

DATA:Y(A_12.00)<=Z(Y_99.99)

A(W_99.99.a)

{?xxZ_99.00} {Z(A_15.07.n)}

;

run;

The matching string could have the following pattern:

- one uppercase letter, followed by underscore and two digits, a dot plus another two digits (ex. Z_99.00) or

- one uppercase letter, followed by underscore and two digits, a dot plus another two digits plus another dot and letter (ex: A_15.00.n)

Desired result:

text	Extraction
{Z(A_15.01.n)+{P(Z_18.10.c)999(Y_18.04.a)B(999)} + {T(X_20.99.d)}	A_15.01.n Z_18.10.c Y_18.04.a X_20.99.d
DATA:Y(A_12.00)<=Z(Y_99.99)	A_12.00 Y_99.99
A(W_99.99.a)	W_99.99.a
{?xxZ_99.00} {Z(A_15.07.n)}	Z_99.00 A_15.07.n

Any help on this topic appreciated.

Many thanks.

ed_sas_member · Posted 05-03-2020 06:33 AM

Hi @Matos

Here is a way to achieve this:

data have;
	input text & $200.;
	cards;
{Z(A_15.01.n)+{P(Z_18.10.c)999(Y_18.04.a)B(999)} + {T(X_20.99.d)}
DATA:Y(A_12.00)<=Z(Y_99.99)
A(W_99.99.a)
{?xxZ_99.00} {Z(A_15.07.n)}
;
run;

data want;
	set have;
	length extraction $ 200;
	array _extraction (200) $ 200;
	expressionID = prxparse('/[A-Z]_\d{2}\.\d{2}(\.[a-z])?/');
	start = 1;
	stop = length(text);
	
	call prxnext(ExpressionID, start, stop, text, position, length);
      do while (position > 0);
         _extraction(position) = substr(text, position, length);
         call prxnext(ExpressionID, start, stop, text, position, length);
      end;	
     
    extraction = catx(" ",of _extraction(*));
      
	drop _: expressionID start	stop position length ;
run;

Output:

You can find interesting documentation here:

- CALL PRXNEXT Routine

https://documentation.sas.com/?docsetId=lefunctionsref&docsetTarget=n1obc9u7z3225mn1npwnassehff0.htm...

- SAS REGEX TIP SHEET https://support.sas.com/rnd/base/datastep/perl_regexp/regexp-tip-sheet.pdf

Hope this helps!

All the best,

View solution in original post

ed_sas_member · Posted 05-03-2020 06:33 AM

Hi @Matos

Here is a way to achieve this:

data have;
	input text & $200.;
	cards;
{Z(A_15.01.n)+{P(Z_18.10.c)999(Y_18.04.a)B(999)} + {T(X_20.99.d)}
DATA:Y(A_12.00)<=Z(Y_99.99)
A(W_99.99.a)
{?xxZ_99.00} {Z(A_15.07.n)}
;
run;

data want;
	set have;
	length extraction $ 200;
	array _extraction (200) $ 200;
	expressionID = prxparse('/[A-Z]_\d{2}\.\d{2}(\.[a-z])?/');
	start = 1;
	stop = length(text);
	
	call prxnext(ExpressionID, start, stop, text, position, length);
      do while (position > 0);
         _extraction(position) = substr(text, position, length);
         call prxnext(ExpressionID, start, stop, text, position, length);
      end;	
     
    extraction = catx(" ",of _extraction(*));
      
	drop _: expressionID start	stop position length ;
run;

Output:

You can find interesting documentation here:

- CALL PRXNEXT Routine

https://documentation.sas.com/?docsetId=lefunctionsref&docsetTarget=n1obc9u7z3225mn1npwnassehff0.htm...

- SAS REGEX TIP SHEET https://support.sas.com/rnd/base/datastep/perl_regexp/regexp-tip-sheet.pdf

Hope this helps!

All the best,

Matos · Posted 05-03-2020 06:46 AM

Thank you very much, also for the documentation! Really useful.

ed_sas_member · Posted 05-03-2020 07:22 AM

You're welcome @Matos !

Matos · Posted 05-03-2020 10:54 AM

Just a follow-up question. In case there are duplicated matches per row, is it possible to adapt the code in a simple way so that is doesn't show those duplicates? Basically the matches inside the array should be unique. I guess there was a similar discussion already, but maybe you have different solutions.

https://communities.sas.com/t5/SAS-Programming/Removing-repeated-substring-in-String-Observation-fro...

https://communities.sas.com/t5/General-SAS-Programming/remove-duplicate-numbers-in-a-string-of-numbe...

Thanks.

ed_sas_member · Posted 05-03-2020 12:03 PM

Hi @Matos

I have modified the code as follows so as to suppress the array and add a proc transpose after managing duplicates. Does that answer your question?

Best,

data have;
	input text & $200.;
	cards;
{Z(A_15.01.n)+{P(Z_18.10.c)999(Y_18.04.a)B(999)} + {T(X_20.99.d)}
DATA:Y(A_12.00)<=Z(Y_99.99) DATA:Y(A_12.00)
A(W_99.99.a) A(W_99.99.a)
{?xxZ_99.00} {Z(A_15.07.n)}
;
run;

data have2;
	set have;
	length extraction $ 200;
	expressionID = prxparse('/[A-Z]_\d{2}\.\d{2}(\.[a-z])?/');
	start = 1;
	stop = length(text);
	
	call prxnext(ExpressionID, start, stop, text, position, length);
      do while (position > 0);
         extraction = strip(substr(text, position, length));
         call prxnext(ExpressionID, start, stop, text, position, length);
         output;
      end;	
      
	drop expressionID start	stop position length ;
run;

/* With "duplicates" */

proc sort data=have2 ;
	by text extract:;
run;

proc transpose data=have2 out=have_tr (drop=_:) prefix=extract;
	var extraction;
	by text notsorted;
run;

data want1;
	set have_tr;
	length extraction $ 200;
	extraction = catx(" ", of extract:);
run;

proc print;
	title "With potential duplicate matches";
run;

/* Without "duplicates" */

proc sort data=have2 nodupkey;
	by text extract:;
run;

proc transpose data=have2 out=have_tr (drop=_:) prefix=extract;
	var extraction;
	by text notsorted;
run;

data want2;
	set have_tr;
	length extraction $ 200;
	extraction = catx(" ", of extract:);
run;

proc print data=want2;
	title "Without duplicate matches";
run;

title;

Matos · Posted 05-03-2020 12:20 PM

Thanks Ed! You offered several interesting options, very useful. We could also take have2 (first time it was generated) and apply proc sql disctinct to exlcude duplicates.

create table want2 
SELECT DISTINCT text, extraction
from have2;
quit;

Jagadishkatam · Posted 05-03-2020 07:07 AM

please try the below code

data have;
input text & :$200.;
cards;
{Z(A_15.01.n)+{P(Z_18.10.c)999(Y_18.04.a)B(999)} + {T(X_20.99.d)}
DATA:Y(A_12.00)<=Z(Y_99.99)
A(W_99.99.a)
{?xxZ_99.00} {Z(A_15.07.n)}
;
run;

data want;  
set have;   
start=1;   
stop=length(text);   
pattern_id=prxparse('/(\w\_\d{2,2}\.\d{2,2}\.\w)|(\w\_\d{2,2}\.\d{2,2})/');   
call prxnext(pattern_id,start,stop,text,position,length);    
do while(position>0);    
new_text=substr(text,position,length);
call prxnext(pattern_id,start,stop,text,position,length);  
output; 
end; 
run;

data want2;
set want;
by text notsorted;
retain new_text2;
if first.text then new_text2=new_text;
else new_text2=catx(' ',new_text2,new_text);
if last.text;
run;

Thanks,
Jag

Matos · Posted 05-03-2020 07:50 AM

Amazing Jag! That's also a nice way to achieve it.

Extract multiple substrings from string using Regex

Re: Extract multiple substrings from string using Regex

Re: Extract multiple substrings from string using Regex

Re: Extract multiple substrings from string using Regex

Re: Extract multiple substrings from string using Regex

Re: Extract multiple substrings from string using Regex

Re: Extract multiple substrings from string using Regex

Re: Extract multiple substrings from string using Regex

Re: Extract multiple substrings from string using Regex

Re: Extract multiple substrings from string using Regex

Registration is open