Hi All,
Is there a quick way to remove unbalanced parentheses /brackets from a string ?
For example in the table below can i get the column "want" as result with just the matched parentheses retained ?
HAVE | WANT |
(This is a test) | (This is a test) |
(This is a test | This is a test |
This is a test) | This is a test |
(This (is) a test | This (is) a test |
(This is) a test) | (This is) a test |
)This (is a test | This is a test |
Hi @NN,
Quick way? Yes, if (like in your examples) there are no nested pairs of parentheses and you can specify two characters which are not present in the strings to work with.
data want;
set have;
string=translate(compress(prxchange("s/\(([^\(\)]*)\)/þ$1ÿ/",-1,string),"()"),"()","þÿ");
run;
The PRXCHANGE function replaces the parentheses in substrings of the form "(text without parentheses)" by replacement characters (I chose 'þ'='FE'x and 'ÿ'='FF'x, but you may want to use others). The remaining, hence unmatched parentheses are removed by the COMPRESS function. Finally, the TRANSLATE function restores the replaced parentheses.
If nested pairs of parentheses may occur in string, apply PRXCHANGE repetitively:
data want;
set have;
do until(string=lag(string));
string=prxchange("s/\(([^\(\)]*)\)/þ$1ÿ/",-1,string);
end;
string=translate(compress(string,"()"),"()","þÿ");
run;
Of course, this implies rules to decide, e.g., which of the three parentheses in your fourth and fifth example is regarded as unmatched.
@NN wrote:
Is there a quick way to remove unbalanced parentheses /brackets from a string ?
Quick way? No.
You'd have to define a very clear and comprehensive set of rules and then program them up.
It is relatively easy to determine when a mismatch occurs. Consider:
data junk; x = "(This is) a test)"; left = countc(x,"("); right= countc(x,")"); run;
If Left is not equal to right then there is a mismatch.
But determining which level of parsing a value for matches is a good piece of writing a programming language syntax parser. For just one example of what goes on:
https://www.cs.iusb.edu/~dvrajito/teach/c311/c311_11_parser.html
The "trivial" mismatch case of exactly one of the parentheses appearing is easy enough.
data junk; x = "(This is a test"; left = countc(x,"("); right= countc(x,")"); if left ne right and max(left, right) = 1 then do; if left then y=compress(x,'('); else y=compress(x,')'); end; run;
But as soon as you get into your other examples you start having to parse a value character by character keeping track of the positions and order of ( and ) characters and then after you have examined the whole value then start applying rules and likely from the "middle" of something outwards.
Next is a tested code:
/***
https://communities.sas.com/t5/SAS-Programming/Remove-unbalanced-parentheses-in-a-string/m-p/680071
input - a string
target - remove unbalanced ()
algorithm:
scan the string
keep positions of '(' followed by ')' and remove them
repeat until found:
- ( without ) and remove the (
- ) preceding ( and remove the )
- none found
insert couples of () back into thier positions
***/
%let maxl=30; /* max string length */
data have;
length string $&maxl; /* adapt to max length */
input string $&maxl..;
cards;
(This is a test)
(This is a test
This is a test)
(This (is) a test
(This is) a test)
)This (is a test
This (is) (a) test)
; run;
data want;
set have;
/* use i as indicator to insert couple of parenthesis */
array opn {&maxl} op1-op&maxl; /* position of '(' */
array cls {&maxl} cl1-cl&maxl; /* position of ')' */
status = 0; /* 0=initial, 1=open found, 2=match found, 9=exit */
i=1;
do until (status=9);
link scan;
end;
/* insert parenthesis back to string */
if i>1 then do j=1 to i-1;
substr(string,opn(j),1) = '(';
substr(string,cls(j),1) = ')';
end;
keep string;
RETURN;
SCAN:
do j =1 to length(string);
ch = substr(string,j,1);
if status = 0 and ch=')' then substr(string,j,1) = ' '; else
if ch = '(' then do;
status=1; opn(i)=j;
end;
if status = 1 and ch = ')' then do;
status = 2; cls(i) = j;
substr(string,opn(i),1) = ' '; /* remove ( */
substr(string,j,1) = ' '; /* remove ) */
i+1;
end;
if j < length(string) then do;
if status=1 then substr(string,opn(i),1) = ' ';
if status=2 then status = 0;
end; else status = 9;
end;
RETURN;
RUN;
Hi @NN,
Quick way? Yes, if (like in your examples) there are no nested pairs of parentheses and you can specify two characters which are not present in the strings to work with.
data want;
set have;
string=translate(compress(prxchange("s/\(([^\(\)]*)\)/þ$1ÿ/",-1,string),"()"),"()","þÿ");
run;
The PRXCHANGE function replaces the parentheses in substrings of the form "(text without parentheses)" by replacement characters (I chose 'þ'='FE'x and 'ÿ'='FF'x, but you may want to use others). The remaining, hence unmatched parentheses are removed by the COMPRESS function. Finally, the TRANSLATE function restores the replaced parentheses.
If nested pairs of parentheses may occur in string, apply PRXCHANGE repetitively:
data want;
set have;
do until(string=lag(string));
string=prxchange("s/\(([^\(\)]*)\)/þ$1ÿ/",-1,string);
end;
string=translate(compress(string,"()"),"()","þÿ");
run;
Of course, this implies rules to decide, e.g., which of the three parentheses in your fourth and fifth example is regarded as unmatched.
data have; input string $40.; cards; (This is a test) (This is a test This is a test) (This (is) a test (This is) a test) )This (is a test This (is) (a) test) ; run; data want; set have; _string=translate(string,' ','()'); pid=prxparse('/\([^()]+\)/'); s=1;e=length(string); call prxnext(pid,s,e,string,p,l); do while(p>0); substr(_string,p,1)='('; substr(_string,p+l-1,1)=')'; call prxnext(pid,s,e,string,p,l); end; keep string _string; run;
@MINX thank you for the challenge.
I have added your example and some for and fixed the code.
I left all my debugging statement for those who may be interested in it -
those lines start with /*DBG*/ putlog
/***
https://communities.sas.com/t5/SAS-Programming/Remove-unbalanced-parentheses-in-a-string/m-p/680071
***/
%let maxl=30; /* max string length */
data have;
length string $&maxl;
input string $&maxl..;
cards;
How (are (you (my friend)))
(This (is)) (a) test)
(This (is)) ((a) test
(This (is)) ((a) test).)
(This is (a) test)
(This is a test
This is a test)
(This (is) a test
(This is) a test)
)This (is a test
This (is) (a) test)
; run;
data want;
set have;
/* level used as indicator to insert couple of parenthesis */
array opn {&maxl} op1-op&maxl; /* position of '(' */
array cls {&maxl} cl1-cl&maxl; /* position of ')' */
_org = string;
/*DBG*/ putlog _N_= string=;
/* status: */
/* 0=initial, */
/* 1=open, */
/* 2="(" follows "(", */
/* 3=")" follows "(", */
/* 9=exit */
status = 0; level=0; delta=0;
max_level = 0;
do until (status=9);
link scan;
end;
/* insert parenthesis back to string */
do j=1 to length(string);
if opn(j) ^= . and cls(j) ^= . then do;;
substr(string,opn(j),1) = '(';
substr(string,cls(j),1) = ')';
end; end;
/*DBG*/ putlog _N_= string=;
keep _org string;
RETURN;
SCAN:
lstr = length(string);
/*DBG*/ putlog _N_= 'length(string)=' lstr;
do j =1 to lstr;
if status=9 then leave;
ch = substr(string,j,1);
if ch = '(' then link opn; else
if ch = ')' then link cls;
if j ge length(string) then link end_scan;
end;
RETURN;
OPN:
level = max_level +1;
max_level = level;
opn(level)=j;
if status = 0 then status=1; else
if status = 1 then status=2;
delta=delta+1;
/*DBG*/ putlog j= ch= level= status= delta= op1= op2= op3= op4=;
ch = ' ';
RETURN;
CLS:
/*DBG*/ putlog 'CLS: ' ch= j= level= status= delta= max_level=;
if status = 0 then substr(string,j,1) = ' '; else
if delta=0 then do;
/*DBG*/ putlog 'CLS: 0';
level = max_level+1;
opn(level) = .;
cls(level) = j;
substr(string,j,1) = ' ';
return;
end;
if status=1 then do; * ) follows ( ;
cls(level) = j;
substr(string,opn(level),1) = ' '; /* remove ( */
substr(string,j,1) = ' '; /* remove ) */
status=0;
end; else
if status=2 then do; * ) follows ( ( ;
cls(level) = j;
substr(string,opn(level),1) = ' '; /* remove ( */
substr(string,j,1) = ' '; /* remove ) */
if delta > 0 then do; delta=delta-1; level=level-1; end;
if delta=0 then status=1;
end; else
if status=3 then do; * ) follows ) ;
level = level - delta;
if level=1 then status=2;
end;
/*DBG*/ putlog 'CLS: ' ch= j= level= status= delta= /
'>>>>> ' op1= op2= op3= op4= cl1= cl2= cl3= cl4= ;
RETURN;
END_SCAN:
do j=1 to lstr;
if opn(j) ne . and cls(j) = . then substr(string,opn(j),1) = ' ';
if cls(j) ne . and opn(j) = . then substr(string,cls(j),1) = ' ';
end;
status = 9;
RETURN;
RUN;
I have added some test cases in order to deal with nested parenthesis.
Next code is tested and seems to do the job as requested:
/***
https://communities.sas.com/t5/SAS-Programming/Remove-unbalanced-parentheses-in-a-string/m-p/680071
***/
%let maxl=30; /* max string length */
data have;
length string $&maxl;
input string $&maxl..;
cards;
How (are (you (my friend)))
(This is (a) test)
(This is a test
This is a test)
(This (is) a test
(This is) a test)
)This (is a test
This (is) (a) test)
; run;
data want;
set have;
*IF _N_ in (1 5 8);
/* level used as indicator to insert couple of parenthesis */
array opn {&maxl} op1-op&maxl; /* position of '(' */
array cls {&maxl} cl1-cl&maxl; /* position of ')' */
_org = string;
/* status: */
/* 0=initial, */
/* 1=open, */
/* 2="(" follows "(", */
/* 3=")" follows "(", */
/* 9=exit */
status = 0; level=0; delta=0;
do until (status=9);
link scan;
end;
/* insert parenthesis back to string */
do j=1 to length(string);
if opn(j) ^= . and cls(j) ^= . then do;;
substr(string,opn(j),1) = '(';
substr(string,cls(j),1) = ')';
end; end;
keep _org string;
RETURN;
SCAN:
lstr = length(string);
do j =1 to lstr;
if status=9 then leave;
ch = substr(string,j,1);
if ch = '(' then link opn; else
if ch = ')' then link cls;
if j ge length(string) then link end_scan;
end;
RETURN;
OPN:
level+1;
opn(level)=j;
if status = 0 then status=1; else
if status = 1 then status=2;
delta=delta+1;
ch = ' ';
RETURN;
CLS:
putlog 'CLS: ' ch= j= level= status= delta=;
if status = 0 then substr(string,j,1) = ' '; else
if status=1 then do; * ) follows ( ;
cls(level) = j;
substr(string,opn(level),1) = ' '; /* remove ( */
substr(string,j,1) = ' '; /* remove ) */
status=0;
end; else
if status=2 then do; * ) follows ( ( ;
cls(level) = j;
substr(string,opn(level),1) = ' '; /* remove ( */
substr(string,j,1) = ' '; /* remove ) */
if delta > 0 then do; delta=delta-1; level=level-1; end;
if delta=0 then status=1;
end; else
if status=3 then do; * ) follows ) ;
level = level - delta;
if level=1 then status=2;
end;
RETURN;
END_SCAN:
do j=1 to lstr;
if opn(j) ne . and cls(j) = . then substr(string,opn(j),1) = ' ';
if cls(j) ne . and opn(j) = . then substr(string,cls(j),1) = ' ';
end;
status = 9;
RETURN;
RUN;
@NN , though @FreelanceReinh's logic is brilliant
but the regex did not result, for my understanding, as expected.
Here is the log of running the code with minor modification, in order to display
both the original value and the outcome value in the log:
73 data want_F; 74 set have(firstobs=5); 75 _org = string; 76 string=translate(compress(prxchange("s/\(([^\(\)]*)\)/þ$1ÿ/",-1,string),"()"),"()","þÿ"); 77 putlog _N_= _org= / ' ' string=; 78 run; _N_=1 _org=(This is (a) test) string=This is ()a( test _N_=2 _org=(This is a test string=This is a test _N_=3 _org=This is a test) string=This is a test _N_=4 _org=(This (is) a test string=This ()is( a test _N_=5 _org=(This is) a test) string=()This is( a test _N_=6 _org=)This (is a test string=This is a test _N_=7 _org=This (is) (a) test) string=This ()is( ()a( test NOTE: There were 7 observations read from the data set WORK.HAVE.
I have coded @FreelanceReinh's logic as a sas step and it is realy much shorter then
mine previous solution. Next is my new code based on this logic:
%let maxl=30; /* max string length */
data have;
length string $&maxl;
input string $&maxl..;
cards;
How (are (you (my friend)))
(This (is)) (a) test)
(This (is)) ((a) test
(This (is)) ((a) test).)
(This is (a) test)
(This is a test
This is a test)
(This (is) a test
(This is) a test)
)This (is a test
This (is) (a) test)
; run;
data want;
set have;
_org = string;
do until (status=9);
link scan;
end;
string = translate(string,' ()','()'||"FAFB"x);
put _N_= _org= / ' ' string=;
RETURN;
SCAN:
do i=1 to length(string);
status = 9;
if substr(string,i,1) = '(' then p = i;
if substr(string,i,1) = ')' and p>0
then do;
substr(string,p,1) = 'FA'x;
substr(string,i,1) = 'FB'x;
p=0;
status=0;
leave;
end;
end;
RETURN;
RUN;
the outcome copied from the log is:
_N_=1 _org=How (are (you (my friend))) string=How (are (you (my friend))) _N_=2 _org=(This (is)) (a) test) string=(This (is)) (a) test _N_=3 _org=(This (is)) ((a) test string=(This (is)) (a) test _N_=4 _org=(This (is)) ((a) test).) string=(This (is)) ((a) test). _N_=5 _org=(This is (a) test) string=(This is (a) test) _N_=6 _org=(This is a test string=This is a test _N_=7 _org=This is a test) string=This is a test _N_=8 _org=(This (is) a test string=This (is) a test _N_=9 _org=(This is) a test) string=(This is) a test _N_=10 _org=)This (is a test string=This is a test _N_=11 _org=This (is) (a) test) string=This (is) (a) test NOTE: There were 11 observations read from the data set WORK.HAVE.
Hi @Shmuel ,
I had tested the second code he had shared with the loop and that seemed to work fine.
I ran the same with the records you shared and below is the result. Is there something that i am missing here ?
23 data want; 24 set have; 25 _org=string; 26 do until(string=lag(string)); 27 string=prxchange("s/\(([^\(\)]*)\)/þ$1ÿ/",-1,string); 28 end; 29 string=translate(compress(string,"()"),"()","þÿ"); 30 put _N_= _org= / ' ' string=; 31 run; _N_=1 _org=How (are (you (my friend))) string=How (are (you (my friend))) _N_=2 _org=(This (is)) (a) test) string=(This (is)) (a) test _N_=3 _org=(This (is)) ((a) test string=(This (is)) (a) test _N_=4 _org=(This (is)) ((a) test).) string=(This (is)) ((a) test). _N_=5 _org=(This is (a) test) string=(This is (a) test) _N_=6 _org=(This is a test string=This is a test _N_=7 _org=This is a test) string=This is a test _N_=8 _org=(This (is) a test string=This (is) a test _N_=9 _org=(This is) a test) string=(This is) a test _N_=10 _org=)This (is a test string=This is a test _N_=11 _org=This (is) (a) test) string=This (is) (a) test
Hi @Shmuel,
Thanks for translating my regex solution into a data step using more elementary functions. I haven't checked the "translation," but it might be helpful for later readers who find regular expressions too cryptic.
I have checked the log you posted and I cannot replicate the results you obtained. Here is my log using the code copied from yours and also using your HAVE dataset (code):
105 data want_F; 106 set have(firstobs=5); 107 _org = string; 108 string=translate(compress(prxchange("s/\(([^\(\)]*)\)/þ$1ÿ/",-1,string),"()"),"()","þÿ"); 109 putlog _N_= _org= / ' ' string=; 110 run; _N_=1 _org=(This is (a) test) string=This is (a) test _N_=2 _org=(This is a test string=This is a test _N_=3 _org=This is a test) string=This is a test _N_=4 _org=(This (is) a test string=This (is) a test _N_=5 _org=(This is) a test) string=(This is) a test _N_=6 _org=)This (is a test string=This is a test _N_=7 _org=This (is) (a) test) string=This (is) (a) test NOTE: There were 7 observations read from the data set WORK.HAVE.
Not sure how the strange results on your computer emerged. (Perhaps some sort of encoding issue?) To investigate it, I would store the intermediate results (i.e., of PRXCHANGE alone, then of COMPRESS) in separate variables. As mentioned earlier, the code version which also handles nested pairs of parentheses is that with the DO-UNTIL loop, but this is not the point here.
@NN, it is realy strange and may be it is a result of the environment differences.
I have run the second code and got the same issue as before:
The code and sample data -
%let maxl=30; /* max string length */
data have;
length string $&maxl;
input string $&maxl..;
cards;
How (are (you (my friend)))
(This (is)) (a) test)
(This (is)) ((a) test
(This (is)) ((a) test).)
(This is (a) test)
(This is a test
This is a test)
(This (is) a test
(This is) a test)
)This (is a test
This (is) (a) test)
; run;
data want_F;
set have;
_org = string;
do until(string=lag(string));
string=prxchange("s/\(([^\(\)]*)\)/þ$1ÿ/",-1,string);
end;
string=translate(compress(string,"()"),"()","þÿ");
putlog _N_= _org= / ' ' string=;
run;
the log with results -
122 data want_F; 123 set have; 124 _org = string; 125 do until(string=lag(string)); 126 string=prxchange("s/\(([^\(\)]*)\)/þ$1ÿ/",-1,string); 127 end; 128 string=translate(compress(string,"()"),"()","þÿ"); 129 putlog _N_= _org= / ' ' string=; 130 run; _N_=1 _org=How (are (you (my friend))) string=How are ()you ()my friend( ( _N_=2 _org=(This (is)) (a) test) string=()This ()is( ( ()a( test _N_=3 _org=(This (is)) ((a) test string=()This ()is( ( ()a( test _N_=4 _org=(This (is)) ((a) test).) string=()This ()is( ( ()()a( test( _N_=5 _org=(This is (a) test) string=()This is ()a( test( _N_=6 _org=(This is a test string=This is a test _N_=7 _org=This is a test) string=This is a test _N_=8 _org=(This (is) a test string=This ()is( a test _N_=9 _org=(This is) a test) string=()This is( a test _N_=10 _org=)This (is a test string=This is a test _N_=11 _org=This (is) (a) test) string=This ()is( ()a( test NOTE: There were 11 observations read from the data set WORK.HAVE.
so, if it works fine at your environment, ignore the issue I have.
I'm using SAS UE with Oracle VM virtual box, windows 10.
I have no explanation for the issue.
While @FreelanceReinh chosed 'þ'='FE'x and 'ÿ'='FF'x,
those characters behave differently in my environment, that is UTF-8:
73 data a; 74 x="þÿ"; 75 putlog x= $hex.; 76 run; x=C3BEC3BF
and this probably explains my issue.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.
Lock in the best rate now before the price increases on April 1.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.