- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi there!
I've been able to get data with nested parentheses to the point of now only having balanced parentheses, but I'm struggling with removing the text (and ideally parentheses also) when there is nesting. Would ideally like to use prxchange and felt like I kept getting close but not quite there.
Have: | Want: |
TTCCH((AA)A)D | TTCCHD |
TTCCH(TT(AA)A)D | TTCCHD |
ABCDEFGH(TTTT)YYYABT | ABCDEFGHYYYABT |
CCGHT()TTCA | CCGHTTTCA |
CHATTTT(A) | CHATTTT |
TATTTT(A) | TATTTT |
CCGG()TTT | CCGGTTT |
CGGAAAA(AA) | CGGAAAA |
CGGAAAA(AA)T | CGGAAAAT |
CGGAAAA((AA)T) | CGGAAAA |
An example already tried but did not work with nesting... although works great with other situations for me: https://communities.sas.com/t5/SAS-Programming/remove-text-inside-brackets-including-brackets/m-p/47...
Novice here with prxchange, but I was trying something along the lines of new= prxchange('s/[\(+\w+\)+]*//i',-1,old)
Would appreciate suggestions, thanks so much!
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
It is depended on how many nested parentheses you have.
data have;
infile datalines truncover dsd;
input (have_str want_str) (:$20.);
datalines;
TTCCH((AA)A)D,TTCCHD
TTCCH(TT(A(xx)A)A)D,TTCCHD
ABCDEFGH(TTTT)YYYABT,ABCDEFGHYYYABT
CCGHT()TTCA,CCGHTTTCA
CHATTTT(A),CHATTTT
TATTTT(A),TATTTT
CCGG()TTT,CCGGTTT
CGGAAAA(AA),CGGAAAA
CGGAAAA(AA)T,CGGAAAAT
CGGAAAA((AA)T),CGGAAAA
AAT(BB)ADAA(CCDC)AXC,AATADAAAXC
AAAAA,AAAAA
;
data want;
set have;
length derived_str $20;
derived_str=have_str ;
do i=1 to 99;
derived_str=prxchange('s/\([^()]*\)//',-1,strip(derived_str));
end;
check_flg= derived_str=want_str;
drop i;
run;
proc print data=want;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Is this an exercise to learn PRX? Otherwise it's a whole lot easier just to do a word scan:
data Want;
set Have;
want = scan(have,1,'(');
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
For your sample data below RegEx will do the job. It works because .* is greedy.
https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.2/lefunctionsref/p0s9ilagexmjl8n1u7e1t1jfnzlk.h...
data have;
infile datalines truncover dsd;
input (have_str want_str) (:$20.);
datalines;
TTCCH((AA)A)D,TTCCHD
TTCCH(TT(AA)A)D,TTCCHD
ABCDEFGH(TTTT)YYYABT,ABCDEFGHYYYABT
CCGHT()TTCA,CCGHTTTCA
CHATTTT(A),CHATTTT
TATTTT(A),TATTTT
CCGG()TTT,CCGGTTT
CGGAAAA(AA),CGGAAAA
CGGAAAA(AA)T,CGGAAAAT
CGGAAAA((AA)T),CGGAAAA
;
data want;
set have;
length derived_str $20;
derived_str=prxchange('s/\(.*\)//',1,strip(have_str));
check_flg= derived_str=want_str;
run;
proc print data=want;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you so much, Patrick. This is so close to a complete fix. Is there a way where this won't remove the letters between closed parentheses also? It is throwing out everything between closed parentheses as well now. I have long strings that may have many sets of closed parentheses.
Have:
AAT(BB)ADAA(CCDC)AXC
Want:
AATADAAAXC
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
As long as you've got balanced brackets below should work.
I've played around with some variations of the RegEx and I'm currently still trying to understand myself why below RegEx works even without having a repeated search/replace (-1 as 2nd parameter to prxmatch) and why it also works with a greedy search for anything but a closing bracket [^\)]*
data have;
infile datalines truncover dsd;
input (have_str want_str) (:$20.);
datalines;
TTCCH((AA)A)D,TTCCHD
TTCCH(TT(A(xx)A)A)D,TTCCHD
ABCDEFGH(TTTT)YYYABT,ABCDEFGHYYYABT
CCGHT()TTCA,CCGHTTTCA
CHATTTT(A),CHATTTT
TATTTT(A),TATTTT
CCGG()TTT,CCGGTTT
CGGAAAA(AA),CGGAAAA
CGGAAAA(AA)T,CGGAAAAT
CGGAAAA((AA)T),CGGAAAA
AAT(BB)ADAA(CCDC)AXC,AATADAAAXC
AAAAA,AAAAA
;
data want;
set have;
length derived_str $20;
derived_str=prxchange('s/\([^\)]*\)//',1,strip(want_str));
check_flg= derived_str=want_str;
run;
proc print data=want;
run;
I found an interesting discussion around this topic here - but as long as your brackets are balanced and you don't want to pick anything between matching brackets, things should work.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Take another look at your code:
derived_str=prxchange('s/\([^\)]*\)//',1,strip(want_str));
You are modifying the WANT_STR variable, not the HAVE_STR. Using HAVE_STR gives different results.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
It is depended on how many nested parentheses you have.
data have;
infile datalines truncover dsd;
input (have_str want_str) (:$20.);
datalines;
TTCCH((AA)A)D,TTCCHD
TTCCH(TT(A(xx)A)A)D,TTCCHD
ABCDEFGH(TTTT)YYYABT,ABCDEFGHYYYABT
CCGHT()TTCA,CCGHTTTCA
CHATTTT(A),CHATTTT
TATTTT(A),TATTTT
CCGG()TTT,CCGGTTT
CGGAAAA(AA),CGGAAAA
CGGAAAA(AA)T,CGGAAAAT
CGGAAAA((AA)T),CGGAAAA
AAT(BB)ADAA(CCDC)AXC,AATADAAAXC
AAAAA,AAAAA
;
data want;
set have;
length derived_str $20;
derived_str=have_str ;
do i=1 to 99;
derived_str=prxchange('s/\([^()]*\)//',-1,strip(derived_str));
end;
check_flg= derived_str=want_str;
drop i;
run;
proc print data=want;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Ksharp The code you're proposing was pretty much how I've done it initially.
But then "playing" with both code and additional sample data I found out that the do loop isn't required, that the RegEx only needs the closing parenthesis and that I even don't need to set -1 as parameter (fully working code in my previous post.
WHY this returns the desired result I still don't fully understand but feel once I do I will have gained deeper insight into the SAS RegEx implementation and how function prxchange() really works.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you so much for the resources and help, Patrick. I wasn't able to get the latest solution you posted to work, but I think it was due to this part as I don't in reality have the 'want' string: derived_str=prxchange('s/\([^\)]*\)//',1,strip(want_str));
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@sas504 🤣 Thanks for the feedback. That explains why the RegEx "worked" even though I couldn't understand why.
Below some actually working code.
data have;
infile datalines truncover dsd;
input (have_str want_str) (:$20.);
datalines;
TTCCH((AA)A)D,TTCCHD
TTCCH(TT(A(xx)A)A)D,TTCCHD
ABCDEFGH(TTTT)YYYABT,ABCDEFGHYYYABT
CCGHT()TTCA,CCGHTTTCA
CHATTTT(A),CHATTTT
TATTTT(A),TATTTT
CCGG()TTT,CCGGTTT
CGGAAAA(AA),CGGAAAA
CGGAAAA(AA)T,CGGAAAAT
CGGAAAA((AA)T),CGGAAAA
AAT(BB)ADAA(CCDC)AXC,AATADAAAXC
AAAAA,AAAAA
;
data want;
set have;
derived_str=have_str;
do _i=1 to 99;
_derived_str=derived_str;
derived_str=prxchange('s/\([^()]*\)//',1,strip(derived_str));
if _derived_str=derived_str then leave;
end;
check_flg= derived_str=want_str;
run;
proc print data=want;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If you allow the pattern to stop at the first right parenthesis then you can take the wrong string from nested groups.
Take for example:
TTCCH(TT(AA)A)D
Going from the first ( to the first ) would remove (TT(AA) and leave A).
Excluding the the ( means that it finds the (AA). But then that leave (TTA) which is what makes the need for the loop.
You can however use another regex to know when to stop looping. (Let me rename the variables so that STRING is the original value and WANT is the output variable.)
So first copy the original string. Then keep removing groups as long as there are anymore groups to remove.
want=string;
do while(prxmatch('/\([^()]*\)/',want));
want = prxchange('s/\([^()]*\)//',-1,want);
end;
You can use the -1 to make it find disjointed groups in a single pass. So to strip something like CG(GA)AAA(AA)T will only take one pass through the loop.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Yeah. I steal your most of code.
My code is trying to remove nest parentheses VIA do loop. For example:
TTCCH(TT(A(xx)A)A)D,TTCCHD
first time is --> TTCCH(TT(AA)A)D,TTCCHD
second time is --> TTCCH(TTA)D,TTCCHD
third time is --> TTCCHD,TTCCHD
About the 1 or -1 ,it doesn't matter you use 1 or -1, as long as 99 is big enough.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you so much for the solution, @Ksharp . This worked very well for every situation I threw at it! I greatly appreciate it!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I see this topic has already been solved, but it may be worth expanding on @SASKiwi's suggestion of the SCAN function (assuming PRX is not required).
Just take the first "word" (using "(" as a word-separator) and the last "word" (using ")" as the separator), and concatenate:
data want;
set have;
new =scan(have_str,1,'(') || scan(have_str,-1,')');
run;
The above assumes:
- There is at least one "(" and one ")".
- The left parenthesis does not appear in the first character of the string.
No loops needed, regardless of how many pairs of parentheses in the string.
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@mkeintz - Nice correction, but then I realised my attempt doesn't doesn't cater for multiple sets of parenthises so I think that a DO UNTIL (or WHILE) loop is still needed for that case.