BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
sas504
Fluorite | Level 6

Hi there! 

I've been able to get data with nested parentheses to the point of now only having balanced parentheses, but I'm struggling with removing the text (and ideally parentheses also) when there is nesting. Would ideally like to use prxchange and felt like I kept getting close but not quite there. 

 

Have:Want:
TTCCH((AA)A)DTTCCHD
TTCCH(TT(AA)A)DTTCCHD
ABCDEFGH(TTTT)YYYABTABCDEFGHYYYABT
CCGHT()TTCACCGHTTTCA
CHATTTT(A)CHATTTT
TATTTT(A)TATTTT
CCGG()TTTCCGGTTT
CGGAAAA(AA)CGGAAAA
CGGAAAA(AA)TCGGAAAAT
CGGAAAA((AA)T)CGGAAAA

 

An example already tried but did not work with nesting... although works great with other situations for me: https://communities.sas.com/t5/SAS-Programming/remove-text-inside-brackets-including-brackets/m-p/47...

 

Novice here with prxchange, but I was trying something along the lines of      new= prxchange('s/[\(+\w+\)+]*//i',-1,old)

 

Would appreciate suggestions, thanks so much! 

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User

It is depended on how many nested parentheses you have.

 

data have;
  infile datalines truncover dsd;
  input (have_str want_str) (:$20.);
  datalines;
TTCCH((AA)A)D,TTCCHD
TTCCH(TT(A(xx)A)A)D,TTCCHD
ABCDEFGH(TTTT)YYYABT,ABCDEFGHYYYABT
CCGHT()TTCA,CCGHTTTCA
CHATTTT(A),CHATTTT
TATTTT(A),TATTTT
CCGG()TTT,CCGGTTT
CGGAAAA(AA),CGGAAAA
CGGAAAA(AA)T,CGGAAAAT
CGGAAAA((AA)T),CGGAAAA
AAT(BB)ADAA(CCDC)AXC,AATADAAAXC 
AAAAA,AAAAA
;

data want;
  set have;
  length derived_str $20;
derived_str=have_str ;
do i=1 to 99;
  derived_str=prxchange('s/\([^()]*\)//',-1,strip(derived_str));
end;
  check_flg= derived_str=want_str;
  drop i;
run;

proc print data=want;
run;

View solution in original post

18 REPLIES 18
SASKiwi
PROC Star

Is this an exercise to learn PRX? Otherwise it's a whole lot easier just to do a word scan:

data Want;
  set Have;
  want = scan(have,1,'(');
run;
Patrick
Opal | Level 21

For your sample data below RegEx will do the job. It works because .* is greedy. 
https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.2/lefunctionsref/p0s9ilagexmjl8n1u7e1t1jfnzlk.h... 

data have;
  infile datalines truncover dsd;
  input (have_str want_str) (:$20.);
  datalines;
TTCCH((AA)A)D,TTCCHD
TTCCH(TT(AA)A)D,TTCCHD
ABCDEFGH(TTTT)YYYABT,ABCDEFGHYYYABT
CCGHT()TTCA,CCGHTTTCA
CHATTTT(A),CHATTTT
TATTTT(A),TATTTT
CCGG()TTT,CCGGTTT
CGGAAAA(AA),CGGAAAA
CGGAAAA(AA)T,CGGAAAAT
CGGAAAA((AA)T),CGGAAAA
;

data want;
  set have;
  length derived_str $20;
  derived_str=prxchange('s/\(.*\)//',1,strip(have_str));
  check_flg= derived_str=want_str;
run;

proc print data=want;
run;

Patrick_0-1672452122218.png

 

sas504
Fluorite | Level 6

Thank you so much, Patrick. This is so close to a complete fix. Is there a way where this won't remove the letters between closed parentheses also? It is throwing out everything between closed parentheses as well now. I have long strings that may have many sets of closed parentheses.

 

Have:

AAT(BB)ADAA(CCDC)AXC 

 

Want:

AATADAAAXC 

Patrick
Opal | Level 21

As long as you've got balanced brackets below should work.

I've played around with some variations of the RegEx and I'm currently still trying to understand myself why below RegEx works even without having a repeated search/replace (-1 as 2nd parameter to prxmatch) and why it also works with a greedy search for anything but a closing bracket [^\)]*

data have;
  infile datalines truncover dsd;
  input (have_str want_str) (:$20.);
  datalines;
TTCCH((AA)A)D,TTCCHD
TTCCH(TT(A(xx)A)A)D,TTCCHD
ABCDEFGH(TTTT)YYYABT,ABCDEFGHYYYABT
CCGHT()TTCA,CCGHTTTCA
CHATTTT(A),CHATTTT
TATTTT(A),TATTTT
CCGG()TTT,CCGGTTT
CGGAAAA(AA),CGGAAAA
CGGAAAA(AA)T,CGGAAAAT
CGGAAAA((AA)T),CGGAAAA
AAT(BB)ADAA(CCDC)AXC,AATADAAAXC 
AAAAA,AAAAA
;

data want;
  set have;
  length derived_str $20;
  derived_str=prxchange('s/\([^\)]*\)//',1,strip(want_str));
  check_flg= derived_str=want_str;
run;

proc print data=want;
run;

Patrick_0-1672469429034.png

 

I found an interesting discussion around this topic here - but as long as your brackets are balanced and you don't want to pick anything between matching brackets, things should work. 

 

Patrick_1-1672469486490.png

 

 

 

 

s_lassen
Meteorite | Level 14

Take another look at your code:

derived_str=prxchange('s/\([^\)]*\)//',1,strip(want_str));

You are modifying the WANT_STR variable, not the HAVE_STR. Using HAVE_STR gives different results.

Ksharp
Super User

It is depended on how many nested parentheses you have.

 

data have;
  infile datalines truncover dsd;
  input (have_str want_str) (:$20.);
  datalines;
TTCCH((AA)A)D,TTCCHD
TTCCH(TT(A(xx)A)A)D,TTCCHD
ABCDEFGH(TTTT)YYYABT,ABCDEFGHYYYABT
CCGHT()TTCA,CCGHTTTCA
CHATTTT(A),CHATTTT
TATTTT(A),TATTTT
CCGG()TTT,CCGGTTT
CGGAAAA(AA),CGGAAAA
CGGAAAA(AA)T,CGGAAAAT
CGGAAAA((AA)T),CGGAAAA
AAT(BB)ADAA(CCDC)AXC,AATADAAAXC 
AAAAA,AAAAA
;

data want;
  set have;
  length derived_str $20;
derived_str=have_str ;
do i=1 to 99;
  derived_str=prxchange('s/\([^()]*\)//',-1,strip(derived_str));
end;
  check_flg= derived_str=want_str;
  drop i;
run;

proc print data=want;
run;
Patrick
Opal | Level 21

@Ksharp  The code you're proposing was pretty much how I've done it initially.

Patrick_1-1672480127139.png

But then "playing" with both code and additional sample data I found out that the do loop isn't required, that the RegEx only needs the closing parenthesis and that I even don't need to set -1 as parameter (fully working code in my previous post.

Patrick_2-1672480237255.png

WHY this returns the desired result I still don't fully understand but feel once I do I will have gained deeper insight into the SAS RegEx implementation and how function prxchange() really works.

 

 

 

sas504
Fluorite | Level 6

Thank you so much for the resources and help, Patrick. I wasn't able to get the latest solution you posted to work, but I think it was due to this part as I don't in reality have the 'want' string: derived_str=prxchange('s/\([^\)]*\)//',1,strip(want_str));

Patrick
Opal | Level 21

@sas504 🤣 Thanks for the feedback. That explains why the RegEx "worked" even though I couldn't understand why. 

Below some actually working code.

data have;
  infile datalines truncover dsd;
  input (have_str want_str) (:$20.);
  datalines;
TTCCH((AA)A)D,TTCCHD
TTCCH(TT(A(xx)A)A)D,TTCCHD
ABCDEFGH(TTTT)YYYABT,ABCDEFGHYYYABT
CCGHT()TTCA,CCGHTTTCA
CHATTTT(A),CHATTTT
TATTTT(A),TATTTT
CCGG()TTT,CCGGTTT
CGGAAAA(AA),CGGAAAA
CGGAAAA(AA)T,CGGAAAAT
CGGAAAA((AA)T),CGGAAAA
AAT(BB)ADAA(CCDC)AXC,AATADAAAXC 
AAAAA,AAAAA
;

data want;
  set have;
  derived_str=have_str;
  do _i=1 to 99;
    _derived_str=derived_str;
    derived_str=prxchange('s/\([^()]*\)//',1,strip(derived_str));
    if _derived_str=derived_str then leave;
  end;
  check_flg= derived_str=want_str;
run;

proc print data=want;
run;
Tom
Super User Tom
Super User

If you allow the pattern to stop at the first right parenthesis then you can take the wrong string from nested groups.

Take for example:

TTCCH(TT(AA)A)D

Going from the first ( to the first ) would remove (TT(AA) and leave A).

Excluding the the ( means that it finds the (AA).  But then that leave (TTA) which is what makes the need for the loop.

 

You can however use another regex to know when to stop looping. (Let me rename the variables so that STRING is the original value and WANT is the output variable.)

So first copy the original string.  Then keep removing groups as long as there are anymore groups to remove.

 want=string;
 do while(prxmatch('/\([^()]*\)/',want));
   want = prxchange('s/\([^()]*\)//',-1,want);
 end;

You can use the -1 to make it find disjointed groups in a single pass.  So to strip something like CG(GA)AAA(AA)T will only take one pass through the loop. 

Ksharp
Super User
Patrick,
Yeah. I steal your most of code.
My code is trying to remove nest parentheses VIA do loop. For example:
TTCCH(TT(A(xx)A)A)D,TTCCHD
first time is --> TTCCH(TT(AA)A)D,TTCCHD
second time is --> TTCCH(TTA)D,TTCCHD
third time is --> TTCCHD,TTCCHD

About the 1 or -1 ,it doesn't matter you use 1 or -1, as long as 99 is big enough.
sas504
Fluorite | Level 6

Thank you so much for the solution, @Ksharp . This worked very well for every situation I threw at it! I greatly appreciate it!

mkeintz
PROC Star

I see this topic has already been solved, but it may be worth expanding on @SASKiwi's suggestion of the SCAN function (assuming PRX is not required).

 

Just take the first "word" (using "(" as a word-separator) and the last "word" (using ")" as the separator), and concatenate:

 

data want;
  set have;
  new =scan(have_str,1,'(') || scan(have_str,-1,')');
run;

 

The above assumes:

  1. There is at least one "(" and one ")".
  2. The left parenthesis does not appear in the first character of the string.

No loops needed, regardless of how many pairs of parentheses in the string.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
SASKiwi
PROC Star

@mkeintz - Nice correction, but then I realised my attempt doesn't doesn't cater for multiple sets of parenthises so I think that a DO UNTIL (or WHILE) loop is still needed for that case. 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 18 replies
  • 6113 views
  • 9 likes
  • 7 in conversation