Hi there!
I've been able to get data with nested parentheses to the point of now only having balanced parentheses, but I'm struggling with removing the text (and ideally parentheses also) when there is nesting. Would ideally like to use prxchange and felt like I kept getting close but not quite there.
Have: | Want: |
TTCCH((AA)A)D | TTCCHD |
TTCCH(TT(AA)A)D | TTCCHD |
ABCDEFGH(TTTT)YYYABT | ABCDEFGHYYYABT |
CCGHT()TTCA | CCGHTTTCA |
CHATTTT(A) | CHATTTT |
TATTTT(A) | TATTTT |
CCGG()TTT | CCGGTTT |
CGGAAAA(AA) | CGGAAAA |
CGGAAAA(AA)T | CGGAAAAT |
CGGAAAA((AA)T) | CGGAAAA |
An example already tried but did not work with nesting... although works great with other situations for me: https://communities.sas.com/t5/SAS-Programming/remove-text-inside-brackets-including-brackets/m-p/47...
Novice here with prxchange, but I was trying something along the lines of new= prxchange('s/[\(+\w+\)+]*//i',-1,old)
Would appreciate suggestions, thanks so much!
It is depended on how many nested parentheses you have.
data have;
infile datalines truncover dsd;
input (have_str want_str) (:$20.);
datalines;
TTCCH((AA)A)D,TTCCHD
TTCCH(TT(A(xx)A)A)D,TTCCHD
ABCDEFGH(TTTT)YYYABT,ABCDEFGHYYYABT
CCGHT()TTCA,CCGHTTTCA
CHATTTT(A),CHATTTT
TATTTT(A),TATTTT
CCGG()TTT,CCGGTTT
CGGAAAA(AA),CGGAAAA
CGGAAAA(AA)T,CGGAAAAT
CGGAAAA((AA)T),CGGAAAA
AAT(BB)ADAA(CCDC)AXC,AATADAAAXC
AAAAA,AAAAA
;
data want;
set have;
length derived_str $20;
derived_str=have_str ;
do i=1 to 99;
derived_str=prxchange('s/\([^()]*\)//',-1,strip(derived_str));
end;
check_flg= derived_str=want_str;
drop i;
run;
proc print data=want;
run;
Is this an exercise to learn PRX? Otherwise it's a whole lot easier just to do a word scan:
data Want;
set Have;
want = scan(have,1,'(');
run;
For your sample data below RegEx will do the job. It works because .* is greedy.
https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.2/lefunctionsref/p0s9ilagexmjl8n1u7e1t1jfnzlk.h...
data have;
infile datalines truncover dsd;
input (have_str want_str) (:$20.);
datalines;
TTCCH((AA)A)D,TTCCHD
TTCCH(TT(AA)A)D,TTCCHD
ABCDEFGH(TTTT)YYYABT,ABCDEFGHYYYABT
CCGHT()TTCA,CCGHTTTCA
CHATTTT(A),CHATTTT
TATTTT(A),TATTTT
CCGG()TTT,CCGGTTT
CGGAAAA(AA),CGGAAAA
CGGAAAA(AA)T,CGGAAAAT
CGGAAAA((AA)T),CGGAAAA
;
data want;
set have;
length derived_str $20;
derived_str=prxchange('s/\(.*\)//',1,strip(have_str));
check_flg= derived_str=want_str;
run;
proc print data=want;
run;
Thank you so much, Patrick. This is so close to a complete fix. Is there a way where this won't remove the letters between closed parentheses also? It is throwing out everything between closed parentheses as well now. I have long strings that may have many sets of closed parentheses.
Have:
AAT(BB)ADAA(CCDC)AXC
Want:
AATADAAAXC
As long as you've got balanced brackets below should work.
I've played around with some variations of the RegEx and I'm currently still trying to understand myself why below RegEx works even without having a repeated search/replace (-1 as 2nd parameter to prxmatch) and why it also works with a greedy search for anything but a closing bracket [^\)]*
data have;
infile datalines truncover dsd;
input (have_str want_str) (:$20.);
datalines;
TTCCH((AA)A)D,TTCCHD
TTCCH(TT(A(xx)A)A)D,TTCCHD
ABCDEFGH(TTTT)YYYABT,ABCDEFGHYYYABT
CCGHT()TTCA,CCGHTTTCA
CHATTTT(A),CHATTTT
TATTTT(A),TATTTT
CCGG()TTT,CCGGTTT
CGGAAAA(AA),CGGAAAA
CGGAAAA(AA)T,CGGAAAAT
CGGAAAA((AA)T),CGGAAAA
AAT(BB)ADAA(CCDC)AXC,AATADAAAXC
AAAAA,AAAAA
;
data want;
set have;
length derived_str $20;
derived_str=prxchange('s/\([^\)]*\)//',1,strip(want_str));
check_flg= derived_str=want_str;
run;
proc print data=want;
run;
I found an interesting discussion around this topic here - but as long as your brackets are balanced and you don't want to pick anything between matching brackets, things should work.
Take another look at your code:
derived_str=prxchange('s/\([^\)]*\)//',1,strip(want_str));
You are modifying the WANT_STR variable, not the HAVE_STR. Using HAVE_STR gives different results.
It is depended on how many nested parentheses you have.
data have;
infile datalines truncover dsd;
input (have_str want_str) (:$20.);
datalines;
TTCCH((AA)A)D,TTCCHD
TTCCH(TT(A(xx)A)A)D,TTCCHD
ABCDEFGH(TTTT)YYYABT,ABCDEFGHYYYABT
CCGHT()TTCA,CCGHTTTCA
CHATTTT(A),CHATTTT
TATTTT(A),TATTTT
CCGG()TTT,CCGGTTT
CGGAAAA(AA),CGGAAAA
CGGAAAA(AA)T,CGGAAAAT
CGGAAAA((AA)T),CGGAAAA
AAT(BB)ADAA(CCDC)AXC,AATADAAAXC
AAAAA,AAAAA
;
data want;
set have;
length derived_str $20;
derived_str=have_str ;
do i=1 to 99;
derived_str=prxchange('s/\([^()]*\)//',-1,strip(derived_str));
end;
check_flg= derived_str=want_str;
drop i;
run;
proc print data=want;
run;
@Ksharp The code you're proposing was pretty much how I've done it initially.
But then "playing" with both code and additional sample data I found out that the do loop isn't required, that the RegEx only needs the closing parenthesis and that I even don't need to set -1 as parameter (fully working code in my previous post.
WHY this returns the desired result I still don't fully understand but feel once I do I will have gained deeper insight into the SAS RegEx implementation and how function prxchange() really works.
Thank you so much for the resources and help, Patrick. I wasn't able to get the latest solution you posted to work, but I think it was due to this part as I don't in reality have the 'want' string: derived_str=prxchange('s/\([^\)]*\)//',1,strip(want_str));
@sas504 🤣 Thanks for the feedback. That explains why the RegEx "worked" even though I couldn't understand why.
Below some actually working code.
data have;
infile datalines truncover dsd;
input (have_str want_str) (:$20.);
datalines;
TTCCH((AA)A)D,TTCCHD
TTCCH(TT(A(xx)A)A)D,TTCCHD
ABCDEFGH(TTTT)YYYABT,ABCDEFGHYYYABT
CCGHT()TTCA,CCGHTTTCA
CHATTTT(A),CHATTTT
TATTTT(A),TATTTT
CCGG()TTT,CCGGTTT
CGGAAAA(AA),CGGAAAA
CGGAAAA(AA)T,CGGAAAAT
CGGAAAA((AA)T),CGGAAAA
AAT(BB)ADAA(CCDC)AXC,AATADAAAXC
AAAAA,AAAAA
;
data want;
set have;
derived_str=have_str;
do _i=1 to 99;
_derived_str=derived_str;
derived_str=prxchange('s/\([^()]*\)//',1,strip(derived_str));
if _derived_str=derived_str then leave;
end;
check_flg= derived_str=want_str;
run;
proc print data=want;
run;
If you allow the pattern to stop at the first right parenthesis then you can take the wrong string from nested groups.
Take for example:
TTCCH(TT(AA)A)D
Going from the first ( to the first ) would remove (TT(AA) and leave A).
Excluding the the ( means that it finds the (AA). But then that leave (TTA) which is what makes the need for the loop.
You can however use another regex to know when to stop looping. (Let me rename the variables so that STRING is the original value and WANT is the output variable.)
So first copy the original string. Then keep removing groups as long as there are anymore groups to remove.
want=string;
do while(prxmatch('/\([^()]*\)/',want));
want = prxchange('s/\([^()]*\)//',-1,want);
end;
You can use the -1 to make it find disjointed groups in a single pass. So to strip something like CG(GA)AAA(AA)T will only take one pass through the loop.
Thank you so much for the solution, @Ksharp . This worked very well for every situation I threw at it! I greatly appreciate it!
I see this topic has already been solved, but it may be worth expanding on @SASKiwi's suggestion of the SCAN function (assuming PRX is not required).
Just take the first "word" (using "(" as a word-separator) and the last "word" (using ")" as the separator), and concatenate:
data want;
set have;
new =scan(have_str,1,'(') || scan(have_str,-1,')');
run;
The above assumes:
No loops needed, regardless of how many pairs of parentheses in the string.
@mkeintz - Nice correction, but then I realised my attempt doesn't doesn't cater for multiple sets of parenthises so I think that a DO UNTIL (or WHILE) loop is still needed for that case.
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.