- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi All,
Is there a quick way to remove unbalanced parentheses /brackets from a string ?
For example in the table below can i get the column "want" as result with just the matched parentheses retained ?
HAVE | WANT |
(This is a test) | (This is a test) |
(This is a test | This is a test |
This is a test) | This is a test |
(This (is) a test | This (is) a test |
(This is) a test) | (This is) a test |
)This (is a test | This is a test |
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi @NN,
Quick way? Yes, if (like in your examples) there are no nested pairs of parentheses and you can specify two characters which are not present in the strings to work with.
data want;
set have;
string=translate(compress(prxchange("s/\(([^\(\)]*)\)/þ$1ÿ/",-1,string),"()"),"()","þÿ");
run;
The PRXCHANGE function replaces the parentheses in substrings of the form "(text without parentheses)" by replacement characters (I chose 'þ'='FE'x and 'ÿ'='FF'x, but you may want to use others). The remaining, hence unmatched parentheses are removed by the COMPRESS function. Finally, the TRANSLATE function restores the replaced parentheses.
If nested pairs of parentheses may occur in string, apply PRXCHANGE repetitively:
data want;
set have;
do until(string=lag(string));
string=prxchange("s/\(([^\(\)]*)\)/þ$1ÿ/",-1,string);
end;
string=translate(compress(string,"()"),"()","þÿ");
run;
Of course, this implies rules to decide, e.g., which of the three parentheses in your fourth and fifth example is regarded as unmatched.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@NN wrote:
Is there a quick way to remove unbalanced parentheses /brackets from a string ?
Quick way? No.
You'd have to define a very clear and comprehensive set of rules and then program them up.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
It is relatively easy to determine when a mismatch occurs. Consider:
data junk; x = "(This is) a test)"; left = countc(x,"("); right= countc(x,")"); run;
If Left is not equal to right then there is a mismatch.
But determining which level of parsing a value for matches is a good piece of writing a programming language syntax parser. For just one example of what goes on:
https://www.cs.iusb.edu/~dvrajito/teach/c311/c311_11_parser.html
The "trivial" mismatch case of exactly one of the parentheses appearing is easy enough.
data junk; x = "(This is a test"; left = countc(x,"("); right= countc(x,")"); if left ne right and max(left, right) = 1 then do; if left then y=compress(x,'('); else y=compress(x,')'); end; run;
But as soon as you get into your other examples you start having to parse a value character by character keeping track of the positions and order of ( and ) characters and then after you have examined the whole value then start applying rules and likely from the "middle" of something outwards.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Next is a tested code:
/***
https://communities.sas.com/t5/SAS-Programming/Remove-unbalanced-parentheses-in-a-string/m-p/680071
input - a string
target - remove unbalanced ()
algorithm:
scan the string
keep positions of '(' followed by ')' and remove them
repeat until found:
- ( without ) and remove the (
- ) preceding ( and remove the )
- none found
insert couples of () back into thier positions
***/
%let maxl=30; /* max string length */
data have;
length string $&maxl; /* adapt to max length */
input string $&maxl..;
cards;
(This is a test)
(This is a test
This is a test)
(This (is) a test
(This is) a test)
)This (is a test
This (is) (a) test)
; run;
data want;
set have;
/* use i as indicator to insert couple of parenthesis */
array opn {&maxl} op1-op&maxl; /* position of '(' */
array cls {&maxl} cl1-cl&maxl; /* position of ')' */
status = 0; /* 0=initial, 1=open found, 2=match found, 9=exit */
i=1;
do until (status=9);
link scan;
end;
/* insert parenthesis back to string */
if i>1 then do j=1 to i-1;
substr(string,opn(j),1) = '(';
substr(string,cls(j),1) = ')';
end;
keep string;
RETURN;
SCAN:
do j =1 to length(string);
ch = substr(string,j,1);
if status = 0 and ch=')' then substr(string,j,1) = ' '; else
if ch = '(' then do;
status=1; opn(i)=j;
end;
if status = 1 and ch = ')' then do;
status = 2; cls(i) = j;
substr(string,opn(i),1) = ' '; /* remove ( */
substr(string,j,1) = ' '; /* remove ) */
i+1;
end;
if j < length(string) then do;
if status=1 then substr(string,opn(i),1) = ' ';
if status=2 then status = 0;
end; else status = 9;
end;
RETURN;
RUN;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi @NN,
Quick way? Yes, if (like in your examples) there are no nested pairs of parentheses and you can specify two characters which are not present in the strings to work with.
data want;
set have;
string=translate(compress(prxchange("s/\(([^\(\)]*)\)/þ$1ÿ/",-1,string),"()"),"()","þÿ");
run;
The PRXCHANGE function replaces the parentheses in substrings of the form "(text without parentheses)" by replacement characters (I chose 'þ'='FE'x and 'ÿ'='FF'x, but you may want to use others). The remaining, hence unmatched parentheses are removed by the COMPRESS function. Finally, the TRANSLATE function restores the replaced parentheses.
If nested pairs of parentheses may occur in string, apply PRXCHANGE repetitively:
data want;
set have;
do until(string=lag(string));
string=prxchange("s/\(([^\(\)]*)\)/þ$1ÿ/",-1,string);
end;
string=translate(compress(string,"()"),"()","þÿ");
run;
Of course, this implies rules to decide, e.g., which of the three parentheses in your fourth and fifth example is regarded as unmatched.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
data have; input string $40.; cards; (This is a test) (This is a test This is a test) (This (is) a test (This is) a test) )This (is a test This (is) (a) test) ; run; data want; set have; _string=translate(string,' ','()'); pid=prxparse('/\([^()]+\)/'); s=1;e=length(string); call prxnext(pid,s,e,string,p,l); do while(p>0); substr(_string,p,1)='('; substr(_string,p+l-1,1)=')'; call prxnext(pid,s,e,string,p,l); end; keep string _string; run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
(This (is)) (a) test)
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@MINX thank you for the challenge.
I have added your example and some for and fixed the code.
I left all my debugging statement for those who may be interested in it -
those lines start with /*DBG*/ putlog
/***
https://communities.sas.com/t5/SAS-Programming/Remove-unbalanced-parentheses-in-a-string/m-p/680071
***/
%let maxl=30; /* max string length */
data have;
length string $&maxl;
input string $&maxl..;
cards;
How (are (you (my friend)))
(This (is)) (a) test)
(This (is)) ((a) test
(This (is)) ((a) test).)
(This is (a) test)
(This is a test
This is a test)
(This (is) a test
(This is) a test)
)This (is a test
This (is) (a) test)
; run;
data want;
set have;
/* level used as indicator to insert couple of parenthesis */
array opn {&maxl} op1-op&maxl; /* position of '(' */
array cls {&maxl} cl1-cl&maxl; /* position of ')' */
_org = string;
/*DBG*/ putlog _N_= string=;
/* status: */
/* 0=initial, */
/* 1=open, */
/* 2="(" follows "(", */
/* 3=")" follows "(", */
/* 9=exit */
status = 0; level=0; delta=0;
max_level = 0;
do until (status=9);
link scan;
end;
/* insert parenthesis back to string */
do j=1 to length(string);
if opn(j) ^= . and cls(j) ^= . then do;;
substr(string,opn(j),1) = '(';
substr(string,cls(j),1) = ')';
end; end;
/*DBG*/ putlog _N_= string=;
keep _org string;
RETURN;
SCAN:
lstr = length(string);
/*DBG*/ putlog _N_= 'length(string)=' lstr;
do j =1 to lstr;
if status=9 then leave;
ch = substr(string,j,1);
if ch = '(' then link opn; else
if ch = ')' then link cls;
if j ge length(string) then link end_scan;
end;
RETURN;
OPN:
level = max_level +1;
max_level = level;
opn(level)=j;
if status = 0 then status=1; else
if status = 1 then status=2;
delta=delta+1;
/*DBG*/ putlog j= ch= level= status= delta= op1= op2= op3= op4=;
ch = ' ';
RETURN;
CLS:
/*DBG*/ putlog 'CLS: ' ch= j= level= status= delta= max_level=;
if status = 0 then substr(string,j,1) = ' '; else
if delta=0 then do;
/*DBG*/ putlog 'CLS: 0';
level = max_level+1;
opn(level) = .;
cls(level) = j;
substr(string,j,1) = ' ';
return;
end;
if status=1 then do; * ) follows ( ;
cls(level) = j;
substr(string,opn(level),1) = ' '; /* remove ( */
substr(string,j,1) = ' '; /* remove ) */
status=0;
end; else
if status=2 then do; * ) follows ( ( ;
cls(level) = j;
substr(string,opn(level),1) = ' '; /* remove ( */
substr(string,j,1) = ' '; /* remove ) */
if delta > 0 then do; delta=delta-1; level=level-1; end;
if delta=0 then status=1;
end; else
if status=3 then do; * ) follows ) ;
level = level - delta;
if level=1 then status=2;
end;
/*DBG*/ putlog 'CLS: ' ch= j= level= status= delta= /
'>>>>> ' op1= op2= op3= op4= cl1= cl2= cl3= cl4= ;
RETURN;
END_SCAN:
do j=1 to lstr;
if opn(j) ne . and cls(j) = . then substr(string,opn(j),1) = ' ';
if cls(j) ne . and opn(j) = . then substr(string,cls(j),1) = ' ';
end;
status = 9;
RETURN;
RUN;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have added some test cases in order to deal with nested parenthesis.
Next code is tested and seems to do the job as requested:
/***
https://communities.sas.com/t5/SAS-Programming/Remove-unbalanced-parentheses-in-a-string/m-p/680071
***/
%let maxl=30; /* max string length */
data have;
length string $&maxl;
input string $&maxl..;
cards;
How (are (you (my friend)))
(This is (a) test)
(This is a test
This is a test)
(This (is) a test
(This is) a test)
)This (is a test
This (is) (a) test)
; run;
data want;
set have;
*IF _N_ in (1 5 8);
/* level used as indicator to insert couple of parenthesis */
array opn {&maxl} op1-op&maxl; /* position of '(' */
array cls {&maxl} cl1-cl&maxl; /* position of ')' */
_org = string;
/* status: */
/* 0=initial, */
/* 1=open, */
/* 2="(" follows "(", */
/* 3=")" follows "(", */
/* 9=exit */
status = 0; level=0; delta=0;
do until (status=9);
link scan;
end;
/* insert parenthesis back to string */
do j=1 to length(string);
if opn(j) ^= . and cls(j) ^= . then do;;
substr(string,opn(j),1) = '(';
substr(string,cls(j),1) = ')';
end; end;
keep _org string;
RETURN;
SCAN:
lstr = length(string);
do j =1 to lstr;
if status=9 then leave;
ch = substr(string,j,1);
if ch = '(' then link opn; else
if ch = ')' then link cls;
if j ge length(string) then link end_scan;
end;
RETURN;
OPN:
level+1;
opn(level)=j;
if status = 0 then status=1; else
if status = 1 then status=2;
delta=delta+1;
ch = ' ';
RETURN;
CLS:
putlog 'CLS: ' ch= j= level= status= delta=;
if status = 0 then substr(string,j,1) = ' '; else
if status=1 then do; * ) follows ( ;
cls(level) = j;
substr(string,opn(level),1) = ' '; /* remove ( */
substr(string,j,1) = ' '; /* remove ) */
status=0;
end; else
if status=2 then do; * ) follows ( ( ;
cls(level) = j;
substr(string,opn(level),1) = ' '; /* remove ( */
substr(string,j,1) = ' '; /* remove ) */
if delta > 0 then do; delta=delta-1; level=level-1; end;
if delta=0 then status=1;
end; else
if status=3 then do; * ) follows ) ;
level = level - delta;
if level=1 then status=2;
end;
RETURN;
END_SCAN:
do j=1 to lstr;
if opn(j) ne . and cls(j) = . then substr(string,opn(j),1) = ' ';
if cls(j) ne . and opn(j) = . then substr(string,cls(j),1) = ' ';
end;
status = 9;
RETURN;
RUN;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Sorry for the delay in my reply.
Most of these solutions work perfectly for my requirement.
But i shall be going ahead with FreelanceReinhards solution as i was looking a regex based compact solution.
Thanks Again to all the Gurus.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@NN , though @FreelanceReinh's logic is brilliant
but the regex did not result, for my understanding, as expected.
Here is the log of running the code with minor modification, in order to display
both the original value and the outcome value in the log:
73 data want_F; 74 set have(firstobs=5); 75 _org = string; 76 string=translate(compress(prxchange("s/\(([^\(\)]*)\)/þ$1ÿ/",-1,string),"()"),"()","þÿ"); 77 putlog _N_= _org= / ' ' string=; 78 run; _N_=1 _org=(This is (a) test) string=This is ()a( test _N_=2 _org=(This is a test string=This is a test _N_=3 _org=This is a test) string=This is a test _N_=4 _org=(This (is) a test string=This ()is( a test _N_=5 _org=(This is) a test) string=()This is( a test _N_=6 _org=)This (is a test string=This is a test _N_=7 _org=This (is) (a) test) string=This ()is( ()a( test NOTE: There were 7 observations read from the data set WORK.HAVE.
I have coded @FreelanceReinh's logic as a sas step and it is realy much shorter then
mine previous solution. Next is my new code based on this logic:
%let maxl=30; /* max string length */
data have;
length string $&maxl;
input string $&maxl..;
cards;
How (are (you (my friend)))
(This (is)) (a) test)
(This (is)) ((a) test
(This (is)) ((a) test).)
(This is (a) test)
(This is a test
This is a test)
(This (is) a test
(This is) a test)
)This (is a test
This (is) (a) test)
; run;
data want;
set have;
_org = string;
do until (status=9);
link scan;
end;
string = translate(string,' ()','()'||"FAFB"x);
put _N_= _org= / ' ' string=;
RETURN;
SCAN:
do i=1 to length(string);
status = 9;
if substr(string,i,1) = '(' then p = i;
if substr(string,i,1) = ')' and p>0
then do;
substr(string,p,1) = 'FA'x;
substr(string,i,1) = 'FB'x;
p=0;
status=0;
leave;
end;
end;
RETURN;
RUN;
the outcome copied from the log is:
_N_=1 _org=How (are (you (my friend))) string=How (are (you (my friend))) _N_=2 _org=(This (is)) (a) test) string=(This (is)) (a) test _N_=3 _org=(This (is)) ((a) test string=(This (is)) (a) test _N_=4 _org=(This (is)) ((a) test).) string=(This (is)) ((a) test). _N_=5 _org=(This is (a) test) string=(This is (a) test) _N_=6 _org=(This is a test string=This is a test _N_=7 _org=This is a test) string=This is a test _N_=8 _org=(This (is) a test string=This (is) a test _N_=9 _org=(This is) a test) string=(This is) a test _N_=10 _org=)This (is a test string=This is a test _N_=11 _org=This (is) (a) test) string=This (is) (a) test NOTE: There were 11 observations read from the data set WORK.HAVE.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Shmuel ,
I had tested the second code he had shared with the loop and that seemed to work fine.
I ran the same with the records you shared and below is the result. Is there something that i am missing here ?
23 data want; 24 set have; 25 _org=string; 26 do until(string=lag(string)); 27 string=prxchange("s/\(([^\(\)]*)\)/þ$1ÿ/",-1,string); 28 end; 29 string=translate(compress(string,"()"),"()","þÿ"); 30 put _N_= _org= / ' ' string=; 31 run; _N_=1 _org=How (are (you (my friend))) string=How (are (you (my friend))) _N_=2 _org=(This (is)) (a) test) string=(This (is)) (a) test _N_=3 _org=(This (is)) ((a) test string=(This (is)) (a) test _N_=4 _org=(This (is)) ((a) test).) string=(This (is)) ((a) test). _N_=5 _org=(This is (a) test) string=(This is (a) test) _N_=6 _org=(This is a test string=This is a test _N_=7 _org=This is a test) string=This is a test _N_=8 _org=(This (is) a test string=This (is) a test _N_=9 _org=(This is) a test) string=(This is) a test _N_=10 _org=)This (is a test string=This is a test _N_=11 _org=This (is) (a) test) string=This (is) (a) test
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Shmuel,
Thanks for translating my regex solution into a data step using more elementary functions. I haven't checked the "translation," but it might be helpful for later readers who find regular expressions too cryptic.
I have checked the log you posted and I cannot replicate the results you obtained. Here is my log using the code copied from yours and also using your HAVE dataset (code):
105 data want_F; 106 set have(firstobs=5); 107 _org = string; 108 string=translate(compress(prxchange("s/\(([^\(\)]*)\)/þ$1ÿ/",-1,string),"()"),"()","þÿ"); 109 putlog _N_= _org= / ' ' string=; 110 run; _N_=1 _org=(This is (a) test) string=This is (a) test _N_=2 _org=(This is a test string=This is a test _N_=3 _org=This is a test) string=This is a test _N_=4 _org=(This (is) a test string=This (is) a test _N_=5 _org=(This is) a test) string=(This is) a test _N_=6 _org=)This (is a test string=This is a test _N_=7 _org=This (is) (a) test) string=This (is) (a) test NOTE: There were 7 observations read from the data set WORK.HAVE.
Not sure how the strange results on your computer emerged. (Perhaps some sort of encoding issue?) To investigate it, I would store the intermediate results (i.e., of PRXCHANGE alone, then of COMPRESS) in separate variables. As mentioned earlier, the code version which also handles nested pairs of parentheses is that with the DO-UNTIL loop, but this is not the point here.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@NN, it is realy strange and may be it is a result of the environment differences.
I have run the second code and got the same issue as before:
The code and sample data -
%let maxl=30; /* max string length */
data have;
length string $&maxl;
input string $&maxl..;
cards;
How (are (you (my friend)))
(This (is)) (a) test)
(This (is)) ((a) test
(This (is)) ((a) test).)
(This is (a) test)
(This is a test
This is a test)
(This (is) a test
(This is) a test)
)This (is a test
This (is) (a) test)
; run;
data want_F;
set have;
_org = string;
do until(string=lag(string));
string=prxchange("s/\(([^\(\)]*)\)/þ$1ÿ/",-1,string);
end;
string=translate(compress(string,"()"),"()","þÿ");
putlog _N_= _org= / ' ' string=;
run;
the log with results -
122 data want_F; 123 set have; 124 _org = string; 125 do until(string=lag(string)); 126 string=prxchange("s/\(([^\(\)]*)\)/þ$1ÿ/",-1,string); 127 end; 128 string=translate(compress(string,"()"),"()","þÿ"); 129 putlog _N_= _org= / ' ' string=; 130 run; _N_=1 _org=How (are (you (my friend))) string=How are ()you ()my friend( ( _N_=2 _org=(This (is)) (a) test) string=()This ()is( ( ()a( test _N_=3 _org=(This (is)) ((a) test string=()This ()is( ( ()a( test _N_=4 _org=(This (is)) ((a) test).) string=()This ()is( ( ()()a( test( _N_=5 _org=(This is (a) test) string=()This is ()a( test( _N_=6 _org=(This is a test string=This is a test _N_=7 _org=This is a test) string=This is a test _N_=8 _org=(This (is) a test string=This ()is( a test _N_=9 _org=(This is) a test) string=()This is( a test _N_=10 _org=)This (is a test string=This is a test _N_=11 _org=This (is) (a) test) string=This ()is( ()a( test NOTE: There were 11 observations read from the data set WORK.HAVE.
so, if it works fine at your environment, ignore the issue I have.
I'm using SAS UE with Oracle VM virtual box, windows 10.
I have no explanation for the issue.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
While @FreelanceReinh chosed 'þ'='FE'x and 'ÿ'='FF'x,
those characters behave differently in my environment, that is UTF-8:
73 data a; 74 x="þÿ"; 75 putlog x= $hex.; 76 run; x=C3BEC3BF
and this probably explains my issue.