Solved: Re: remove text inside nested parentheses using PRX

sas504 · Posted 12-30-2022 07:19 PM

Hi there!

I've been able to get data with nested parentheses to the point of now only having balanced parentheses, but I'm struggling with removing the text (and ideally parentheses also) when there is nesting. Would ideally like to use prxchange and felt like I kept getting close but not quite there.

Have:	Want:
TTCCH((AA)A)D	TTCCHD
TTCCH(TT(AA)A)D	TTCCHD
ABCDEFGH(TTTT)YYYABT	ABCDEFGHYYYABT
CCGHT()TTCA	CCGHTTTCA
CHATTTT(A)	CHATTTT
TATTTT(A)	TATTTT
CCGG()TTT	CCGGTTT
CGGAAAA(AA)	CGGAAAA
CGGAAAA(AA)T	CGGAAAAT
CGGAAAA((AA)T)	CGGAAAA

An example already tried but did not work with nesting... although works great with other situations for me: https://communities.sas.com/t5/SAS-Programming/remove-text-inside-brackets-including-brackets/m-p/47...

Novice here with prxchange, but I was trying something along the lines of new= prxchange('s/[\(+\w+\)+]*//i',-1,old)

Would appreciate suggestions, thanks so much!

Ksharp · Posted 12-31-2022 04:33 AM

It is depended on how many nested parentheses you have.

data have;
  infile datalines truncover dsd;
  input (have_str want_str) (:$20.);
  datalines;
TTCCH((AA)A)D,TTCCHD
TTCCH(TT(A(xx)A)A)D,TTCCHD
ABCDEFGH(TTTT)YYYABT,ABCDEFGHYYYABT
CCGHT()TTCA,CCGHTTTCA
CHATTTT(A),CHATTTT
TATTTT(A),TATTTT
CCGG()TTT,CCGGTTT
CGGAAAA(AA),CGGAAAA
CGGAAAA(AA)T,CGGAAAAT
CGGAAAA((AA)T),CGGAAAA
AAT(BB)ADAA(CCDC)AXC,AATADAAAXC 
AAAAA,AAAAA
;

data want;
  set have;
  length derived_str $20;
derived_str=have_str ;
do i=1 to 99;
  derived_str=prxchange('s/\([^()]*\)//',-1,strip(derived_str));
end;
  check_flg= derived_str=want_str;
  drop i;
run;

proc print data=want;
run;

View solution in original post

SASKiwi · Posted 12-30-2022 08:24 PM

Is this an exercise to learn PRX? Otherwise it's a whole lot easier just to do a word scan:

data Want;
  set Have;
  want = scan(have,1,'(');
run;

Patrick · Posted 12-30-2022 09:01 PM

For your sample data below RegEx will do the job. It works because .* is greedy.
https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.2/lefunctionsref/p0s9ilagexmjl8n1u7e1t1jfnzlk.h...

data have;
  infile datalines truncover dsd;
  input (have_str want_str) (:$20.);
  datalines;
TTCCH((AA)A)D,TTCCHD
TTCCH(TT(AA)A)D,TTCCHD
ABCDEFGH(TTTT)YYYABT,ABCDEFGHYYYABT
CCGHT()TTCA,CCGHTTTCA
CHATTTT(A),CHATTTT
TATTTT(A),TATTTT
CCGG()TTT,CCGGTTT
CGGAAAA(AA),CGGAAAA
CGGAAAA(AA)T,CGGAAAAT
CGGAAAA((AA)T),CGGAAAA
;

data want;
  set have;
  length derived_str $20;
  derived_str=prxchange('s/\(.*\)//',1,strip(have_str));
  check_flg= derived_str=want_str;
run;

proc print data=want;
run;

sas504 · Posted 12-31-2022 12:00 AM

Thank you so much, Patrick. This is so close to a complete fix. Is there a way where this won't remove the letters between closed parentheses also? It is throwing out everything between closed parentheses as well now. I have long strings that may have many sets of closed parentheses.

Have:

AAT(BB)ADAA(CCDC)AXC

Want:

AATADAAAXC

Patrick · Posted 12-31-2022 01:54 AM

As long as you've got balanced brackets below should work.

I've played around with some variations of the RegEx and I'm currently still trying to understand myself why below RegEx works even without having a repeated search/replace (-1 as 2nd parameter to prxmatch) and why it also works with a greedy search for anything but a closing bracket [^\)]*

data have;
  infile datalines truncover dsd;
  input (have_str want_str) (:$20.);
  datalines;
TTCCH((AA)A)D,TTCCHD
TTCCH(TT(A(xx)A)A)D,TTCCHD
ABCDEFGH(TTTT)YYYABT,ABCDEFGHYYYABT
CCGHT()TTCA,CCGHTTTCA
CHATTTT(A),CHATTTT
TATTTT(A),TATTTT
CCGG()TTT,CCGGTTT
CGGAAAA(AA),CGGAAAA
CGGAAAA(AA)T,CGGAAAAT
CGGAAAA((AA)T),CGGAAAA
AAT(BB)ADAA(CCDC)AXC,AATADAAAXC 
AAAAA,AAAAA
;

data want;
  set have;
  length derived_str $20;
  derived_str=prxchange('s/\([^\)]*\)//',1,strip(want_str));
  check_flg= derived_str=want_str;
run;

proc print data=want;
run;

I found an interesting discussion around this topic here - but as long as your brackets are balanced and you don't want to pick anything between matching brackets, things should work.

s_lassen · Posted 12-31-2022 06:29 AM

Take another look at your code:

derived_str=prxchange('s/\([^\)]*\)//',1,strip(want_str));

You are modifying the WANT_STR variable, not the HAVE_STR. Using HAVE_STR gives different results.

Ksharp · Posted 12-31-2022 04:33 AM

It is depended on how many nested parentheses you have.

data have;
  infile datalines truncover dsd;
  input (have_str want_str) (:$20.);
  datalines;
TTCCH((AA)A)D,TTCCHD
TTCCH(TT(A(xx)A)A)D,TTCCHD
ABCDEFGH(TTTT)YYYABT,ABCDEFGHYYYABT
CCGHT()TTCA,CCGHTTTCA
CHATTTT(A),CHATTTT
TATTTT(A),TATTTT
CCGG()TTT,CCGGTTT
CGGAAAA(AA),CGGAAAA
CGGAAAA(AA)T,CGGAAAAT
CGGAAAA((AA)T),CGGAAAA
AAT(BB)ADAA(CCDC)AXC,AATADAAAXC 
AAAAA,AAAAA
;

data want;
  set have;
  length derived_str $20;
derived_str=have_str ;
do i=1 to 99;
  derived_str=prxchange('s/\([^()]*\)//',-1,strip(derived_str));
end;
  check_flg= derived_str=want_str;
  drop i;
run;

proc print data=want;
run;

Patrick · Posted 12-31-2022 04:54 AM

@Ksharp The code you're proposing was pretty much how I've done it initially.

But then "playing" with both code and additional sample data I found out that the do loop isn't required, that the RegEx only needs the closing parenthesis and that I even don't need to set -1 as parameter (fully working code in my previous post.

WHY this returns the desired result I still don't fully understand but feel once I do I will have gained deeper insight into the SAS RegEx implementation and how function prxchange() really works.

sas504 · Posted 12-31-2022 10:44 AM

Thank you so much for the resources and help, Patrick. I wasn't able to get the latest solution you posted to work, but I think it was due to this part as I don't in reality have the 'want' string: derived_str=prxchange('s/\([^\)]*\)//',1,strip(want_str));

Patrick · Posted 12-31-2022 05:23 PM

@sas504 🤣 Thanks for the feedback. That explains why the RegEx "worked" even though I couldn't understand why.

Below some actually working code.

data have;
  infile datalines truncover dsd;
  input (have_str want_str) (:$20.);
  datalines;
TTCCH((AA)A)D,TTCCHD
TTCCH(TT(A(xx)A)A)D,TTCCHD
ABCDEFGH(TTTT)YYYABT,ABCDEFGHYYYABT
CCGHT()TTCA,CCGHTTTCA
CHATTTT(A),CHATTTT
TATTTT(A),TATTTT
CCGG()TTT,CCGGTTT
CGGAAAA(AA),CGGAAAA
CGGAAAA(AA)T,CGGAAAAT
CGGAAAA((AA)T),CGGAAAA
AAT(BB)ADAA(CCDC)AXC,AATADAAAXC 
AAAAA,AAAAA
;

data want;
  set have;
  derived_str=have_str;
  do _i=1 to 99;
    _derived_str=derived_str;
    derived_str=prxchange('s/\([^()]*\)//',1,strip(derived_str));
    if _derived_str=derived_str then leave;
  end;
  check_flg= derived_str=want_str;
run;

proc print data=want;
run;

Tom · Posted 12-31-2022 12:44 PM

If you allow the pattern to stop at the first right parenthesis then you can take the wrong string from nested groups.

Take for example:

TTCCH(TT(AA)A)D

Going from the first ( to the first ) would remove (TT(AA) and leave A).

Excluding the the ( means that it finds the (AA). But then that leave (TTA) which is what makes the need for the loop.

You can however use another regex to know when to stop looping. (Let me rename the variables so that STRING is the original value and WANT is the output variable.)

So first copy the original string. Then keep removing groups as long as there are anymore groups to remove.

 want=string;
 do while(prxmatch('/\([^()]*\)/',want));
   want = prxchange('s/\([^()]*\)//',-1,want);
 end;

You can use the -1 to make it find disjointed groups in a single pass. So to strip something like CG(GA)AAA(AA)T will only take one pass through the loop.

Ksharp · Posted 01-01-2023 03:16 AM

Patrick,
Yeah. I steal your most of code.
My code is trying to remove nest parentheses VIA do loop. For example:
TTCCH(TT(A(xx)A)A)D,TTCCHD
first time is --> TTCCH(TT(AA)A)D,TTCCHD
second time is --> TTCCH(TTA)D,TTCCHD
third time is --> TTCCHD,TTCCHD

About the 1 or -1 ,it doesn't matter you use 1 or -1, as long as 99 is big enough.

sas504 · Posted 12-31-2022 10:47 AM

Thank you so much for the solution, @Ksharp . This worked very well for every situation I threw at it! I greatly appreciate it!

mkeintz · Posted 12-31-2022 12:34 PM

I see this topic has already been solved, but it may be worth expanding on @SASKiwi's suggestion of the SCAN function (assuming PRX is not required).

Just take the first "word" (using "(" as a word-separator) and the last "word" (using ")" as the separator), and concatenate:

data want;
  set have;
  new =scan(have_str,1,'(') || scan(have_str,-1,')');
run;

The above assumes:

There is at least one "(" and one ")".
The left parenthesis does not appear in the first character of the string.

No loops needed, regardless of how many pairs of parentheses in the string.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SASKiwi · Posted 12-31-2022 02:32 PM

@mkeintz - Nice correction, but then I realised my attempt doesn't doesn't cater for multiple sets of parenthises so I think that a DO UNTIL (or WHILE) loop is still needed for that case.

Registration is open

SAS Training: Just a Click Away