BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Matos
Obsidian | Level 7

Dear all,

I am new with Regex and I would like to use it to extract from a string all substrings that match a specific pattern. The output should display, in a new column (called extraction), all matching substrings together separated by a space. Below you can find the initial dataset.

 

 

data have;

input text & $200.;

cards;

{Z(A_15.01.n)+{P(Z_18.10.c)999(Y_18.04.a)B(999)} + {T(X_20.99.d)}

DATA:Y(A_12.00)<=Z(Y_99.99)

A(W_99.99.a)

{?xxZ_99.00} {Z(A_15.07.n)}

;

run;

 

 

The matching string could have the following pattern:

 

- one uppercase letter, followed by underscore and two digits, a dot plus another two digits (ex. Z_99.00) or

- one uppercase letter, followed by underscore and two digits, a dot plus another two digits plus another dot and letter (ex: A_15.00.n)

 

Desired result:

 
textExtraction
{Z(A_15.01.n)+{P(Z_18.10.c)999(Y_18.04.a)B(999)} + {T(X_20.99.d)}A_15.01.n Z_18.10.c Y_18.04.a X_20.99.d
DATA:Y(A_12.00)<=Z(Y_99.99)A_12.00 Y_99.99
A(W_99.99.a)W_99.99.a
{?xxZ_99.00} {Z(A_15.07.n)}Z_99.00 A_15.07.n
 

Any help on this topic appreciated.

 

Many thanks.

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
ed_sas_member
Meteorite | Level 14

Hi @Matos 

 

Here is a way to achieve this:

 

data have;
	input text & $200.;
	cards;
{Z(A_15.01.n)+{P(Z_18.10.c)999(Y_18.04.a)B(999)} + {T(X_20.99.d)}
DATA:Y(A_12.00)<=Z(Y_99.99)
A(W_99.99.a)
{?xxZ_99.00} {Z(A_15.07.n)}
;
run;

data want;
	set have;
	length extraction $ 200;
	array _extraction (200) $ 200;
	expressionID = prxparse('/[A-Z]_\d{2}\.\d{2}(\.[a-z])?/');
	start = 1;
	stop = length(text);
	
	call prxnext(ExpressionID, start, stop, text, position, length);
      do while (position > 0);
         _extraction(position) = substr(text, position, length);
         call prxnext(ExpressionID, start, stop, text, position, length);
      end;	
     
    extraction = catx(" ",of _extraction(*));
      
	drop _: expressionID start	stop position length ;
run;

Output:

Capture d’écran 2020-05-03 à 12.31.12.png

 

You can find interesting documentation here:

- CALL PRXNEXT Routine

 https://documentation.sas.com/?docsetId=lefunctionsref&docsetTarget=n1obc9u7z3225mn1npwnassehff0.htm...

- SAS REGEX TIP SHEET https://support.sas.com/rnd/base/datastep/perl_regexp/regexp-tip-sheet.pdf 

 

Hope this helps!

 

All the best,

View solution in original post

8 REPLIES 8
ed_sas_member
Meteorite | Level 14

Hi @Matos 

 

Here is a way to achieve this:

 

data have;
	input text & $200.;
	cards;
{Z(A_15.01.n)+{P(Z_18.10.c)999(Y_18.04.a)B(999)} + {T(X_20.99.d)}
DATA:Y(A_12.00)<=Z(Y_99.99)
A(W_99.99.a)
{?xxZ_99.00} {Z(A_15.07.n)}
;
run;

data want;
	set have;
	length extraction $ 200;
	array _extraction (200) $ 200;
	expressionID = prxparse('/[A-Z]_\d{2}\.\d{2}(\.[a-z])?/');
	start = 1;
	stop = length(text);
	
	call prxnext(ExpressionID, start, stop, text, position, length);
      do while (position > 0);
         _extraction(position) = substr(text, position, length);
         call prxnext(ExpressionID, start, stop, text, position, length);
      end;	
     
    extraction = catx(" ",of _extraction(*));
      
	drop _: expressionID start	stop position length ;
run;

Output:

Capture d’écran 2020-05-03 à 12.31.12.png

 

You can find interesting documentation here:

- CALL PRXNEXT Routine

 https://documentation.sas.com/?docsetId=lefunctionsref&docsetTarget=n1obc9u7z3225mn1npwnassehff0.htm...

- SAS REGEX TIP SHEET https://support.sas.com/rnd/base/datastep/perl_regexp/regexp-tip-sheet.pdf 

 

Hope this helps!

 

All the best,

Matos
Obsidian | Level 7
Thank you very much, also for the documentation! Really useful.
ed_sas_member
Meteorite | Level 14
You're welcome @Matos !
Matos
Obsidian | Level 7

Just a follow-up question. In case there are duplicated matches per row, is it possible to adapt the code in a simple way so that is doesn't show those duplicates? Basically the matches inside the array should be unique. I guess there was a similar discussion already, but maybe you have different solutions.

 

https://communities.sas.com/t5/SAS-Programming/Removing-repeated-substring-in-String-Observation-fro...

https://communities.sas.com/t5/General-SAS-Programming/remove-duplicate-numbers-in-a-string-of-numbe...

 

Thanks.

ed_sas_member
Meteorite | Level 14

Hi @Matos 

 

I have modified the code as follows so as to suppress the array and add a proc transpose after managing duplicates. Does that answer your question?

Best,

data have;
	input text & $200.;
	cards;
{Z(A_15.01.n)+{P(Z_18.10.c)999(Y_18.04.a)B(999)} + {T(X_20.99.d)}
DATA:Y(A_12.00)<=Z(Y_99.99) DATA:Y(A_12.00)
A(W_99.99.a) A(W_99.99.a)
{?xxZ_99.00} {Z(A_15.07.n)}
;
run;

data have2;
	set have;
	length extraction $ 200;
	expressionID = prxparse('/[A-Z]_\d{2}\.\d{2}(\.[a-z])?/');
	start = 1;
	stop = length(text);
	
	call prxnext(ExpressionID, start, stop, text, position, length);
      do while (position > 0);
         extraction = strip(substr(text, position, length));
         call prxnext(ExpressionID, start, stop, text, position, length);
         output;
      end;	
      
	drop expressionID start	stop position length ;
run;

/* With "duplicates" */

proc sort data=have2 ;
	by text extract:;
run;

proc transpose data=have2 out=have_tr (drop=_:) prefix=extract;
	var extraction;
	by text notsorted;
run;

data want1;
	set have_tr;
	length extraction $ 200;
	extraction = catx(" ", of extract:);
run;

proc print;
	title "With potential duplicate matches";
run;

/* Without "duplicates" */

proc sort data=have2 nodupkey;
	by text extract:;
run;

proc transpose data=have2 out=have_tr (drop=_:) prefix=extract;
	var extraction;
	by text notsorted;
run;

data want2;
	set have_tr;
	length extraction $ 200;
	extraction = catx(" ", of extract:);
run;

proc print data=want2;
	title "Without duplicate matches";
run;

title;

Capture d’écran 2020-05-03 à 18.00.00.png

Matos
Obsidian | Level 7

Thanks Ed! You offered several interesting options, very useful. We could also take have2 (first time it was generated) and apply proc sql disctinct to exlcude duplicates.

 

create table want2 
SELECT DISTINCT text, extraction
from have2;
quit;

 

Jagadishkatam
Amethyst | Level 16

please try the below code

 

data have;
input text & :$200.;
cards;
{Z(A_15.01.n)+{P(Z_18.10.c)999(Y_18.04.a)B(999)} + {T(X_20.99.d)}
DATA:Y(A_12.00)<=Z(Y_99.99)
A(W_99.99.a)
{?xxZ_99.00} {Z(A_15.07.n)}
;
run;

data want;  
set have;   
start=1;   
stop=length(text);   
pattern_id=prxparse('/(\w\_\d{2,2}\.\d{2,2}\.\w)|(\w\_\d{2,2}\.\d{2,2})/');   
call prxnext(pattern_id,start,stop,text,position,length);    
do while(position>0);    
new_text=substr(text,position,length);
call prxnext(pattern_id,start,stop,text,position,length);  
output; 
end; 
run;

data want2;
set want;
by text notsorted;
retain new_text2;
if first.text then new_text2=new_text;
else new_text2=catx(' ',new_text2,new_text);
if last.text;
run;
Thanks,
Jag
Matos
Obsidian | Level 7
Amazing Jag! That's also a nice way to achieve it.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 4994 views
  • 6 likes
  • 3 in conversation