Dear all,
I am new with Regex and I would like to use it to extract from a string all substrings that match a specific pattern. The output should display, in a new column (called extraction), all matching substrings together separated by a space. Below you can find the initial dataset.
data have;
input text & $200.;
cards;
{Z(A_15.01.n)+{P(Z_18.10.c)999(Y_18.04.a)B(999)} + {T(X_20.99.d)}
DATA:Y(A_12.00)<=Z(Y_99.99)
A(W_99.99.a)
{?xxZ_99.00} {Z(A_15.07.n)}
;
run;
The matching string could have the following pattern:
- one uppercase letter, followed by underscore and two digits, a dot plus another two digits (ex. Z_99.00) or
- one uppercase letter, followed by underscore and two digits, a dot plus another two digits plus another dot and letter (ex: A_15.00.n)
Desired result:
text | Extraction |
{Z(A_15.01.n)+{P(Z_18.10.c)999(Y_18.04.a)B(999)} + {T(X_20.99.d)} | A_15.01.n Z_18.10.c Y_18.04.a X_20.99.d |
DATA:Y(A_12.00)<=Z(Y_99.99) | A_12.00 Y_99.99 |
A(W_99.99.a) | W_99.99.a |
{?xxZ_99.00} {Z(A_15.07.n)} | Z_99.00 A_15.07.n |
Any help on this topic appreciated.
Many thanks.
Hi @Matos
Here is a way to achieve this:
data have;
input text & $200.;
cards;
{Z(A_15.01.n)+{P(Z_18.10.c)999(Y_18.04.a)B(999)} + {T(X_20.99.d)}
DATA:Y(A_12.00)<=Z(Y_99.99)
A(W_99.99.a)
{?xxZ_99.00} {Z(A_15.07.n)}
;
run;
data want;
set have;
length extraction $ 200;
array _extraction (200) $ 200;
expressionID = prxparse('/[A-Z]_\d{2}\.\d{2}(\.[a-z])?/');
start = 1;
stop = length(text);
call prxnext(ExpressionID, start, stop, text, position, length);
do while (position > 0);
_extraction(position) = substr(text, position, length);
call prxnext(ExpressionID, start, stop, text, position, length);
end;
extraction = catx(" ",of _extraction(*));
drop _: expressionID start stop position length ;
run;
Output:
You can find interesting documentation here:
- CALL PRXNEXT Routine
- SAS REGEX TIP SHEET https://support.sas.com/rnd/base/datastep/perl_regexp/regexp-tip-sheet.pdf
Hope this helps!
All the best,
Hi @Matos
Here is a way to achieve this:
data have;
input text & $200.;
cards;
{Z(A_15.01.n)+{P(Z_18.10.c)999(Y_18.04.a)B(999)} + {T(X_20.99.d)}
DATA:Y(A_12.00)<=Z(Y_99.99)
A(W_99.99.a)
{?xxZ_99.00} {Z(A_15.07.n)}
;
run;
data want;
set have;
length extraction $ 200;
array _extraction (200) $ 200;
expressionID = prxparse('/[A-Z]_\d{2}\.\d{2}(\.[a-z])?/');
start = 1;
stop = length(text);
call prxnext(ExpressionID, start, stop, text, position, length);
do while (position > 0);
_extraction(position) = substr(text, position, length);
call prxnext(ExpressionID, start, stop, text, position, length);
end;
extraction = catx(" ",of _extraction(*));
drop _: expressionID start stop position length ;
run;
Output:
You can find interesting documentation here:
- CALL PRXNEXT Routine
- SAS REGEX TIP SHEET https://support.sas.com/rnd/base/datastep/perl_regexp/regexp-tip-sheet.pdf
Hope this helps!
All the best,
Just a follow-up question. In case there are duplicated matches per row, is it possible to adapt the code in a simple way so that is doesn't show those duplicates? Basically the matches inside the array should be unique. I guess there was a similar discussion already, but maybe you have different solutions.
Thanks.
Hi @Matos
I have modified the code as follows so as to suppress the array and add a proc transpose after managing duplicates. Does that answer your question?
Best,
data have;
input text & $200.;
cards;
{Z(A_15.01.n)+{P(Z_18.10.c)999(Y_18.04.a)B(999)} + {T(X_20.99.d)}
DATA:Y(A_12.00)<=Z(Y_99.99) DATA:Y(A_12.00)
A(W_99.99.a) A(W_99.99.a)
{?xxZ_99.00} {Z(A_15.07.n)}
;
run;
data have2;
set have;
length extraction $ 200;
expressionID = prxparse('/[A-Z]_\d{2}\.\d{2}(\.[a-z])?/');
start = 1;
stop = length(text);
call prxnext(ExpressionID, start, stop, text, position, length);
do while (position > 0);
extraction = strip(substr(text, position, length));
call prxnext(ExpressionID, start, stop, text, position, length);
output;
end;
drop expressionID start stop position length ;
run;
/* With "duplicates" */
proc sort data=have2 ;
by text extract:;
run;
proc transpose data=have2 out=have_tr (drop=_:) prefix=extract;
var extraction;
by text notsorted;
run;
data want1;
set have_tr;
length extraction $ 200;
extraction = catx(" ", of extract:);
run;
proc print;
title "With potential duplicate matches";
run;
/* Without "duplicates" */
proc sort data=have2 nodupkey;
by text extract:;
run;
proc transpose data=have2 out=have_tr (drop=_:) prefix=extract;
var extraction;
by text notsorted;
run;
data want2;
set have_tr;
length extraction $ 200;
extraction = catx(" ", of extract:);
run;
proc print data=want2;
title "Without duplicate matches";
run;
title;
Thanks Ed! You offered several interesting options, very useful. We could also take have2 (first time it was generated) and apply proc sql disctinct to exlcude duplicates.
create table want2
SELECT DISTINCT text, extraction
from have2;
quit;
please try the below code
data have;
input text & :$200.;
cards;
{Z(A_15.01.n)+{P(Z_18.10.c)999(Y_18.04.a)B(999)} + {T(X_20.99.d)}
DATA:Y(A_12.00)<=Z(Y_99.99)
A(W_99.99.a)
{?xxZ_99.00} {Z(A_15.07.n)}
;
run;
data want;
set have;
start=1;
stop=length(text);
pattern_id=prxparse('/(\w\_\d{2,2}\.\d{2,2}\.\w)|(\w\_\d{2,2}\.\d{2,2})/');
call prxnext(pattern_id,start,stop,text,position,length);
do while(position>0);
new_text=substr(text,position,length);
call prxnext(pattern_id,start,stop,text,position,length);
output;
end;
run;
data want2;
set want;
by text notsorted;
retain new_text2;
if first.text then new_text2=new_text;
else new_text2=catx(' ',new_text2,new_text);
if last.text;
run;
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.