BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
NN
Quartz | Level 8 NN
Quartz | Level 8

Hi All,

Is there a quick way to remove unbalanced parentheses /brackets  from a string ?

For example in the table below can i get the column "want" as result with just the matched parentheses retained ?

HAVE WANT
(This is a test) (This is a test)
(This is a test This is a test
This is a test) This is a test
(This (is) a test This (is) a test
(This is) a test) (This is) a test
)This (is a test This is a test
1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Hi @NN,

 

Quick way? Yes, if (like in your examples) there are no nested pairs of parentheses and you can specify two characters which are not present in the strings to work with.

data want;
set have;
string=translate(compress(prxchange("s/\(([^\(\)]*)\)/þ$1ÿ/",-1,string),"()"),"()","þÿ");
run;

The PRXCHANGE function replaces the parentheses in substrings of the form "(text without parentheses)" by replacement characters (I chose 'þ'='FE'x and 'ÿ'='FF'x, but you may want to use others). The remaining, hence unmatched parentheses are removed by the COMPRESS function. Finally, the TRANSLATE function restores the replaced parentheses.

 

If nested pairs of parentheses may occur in string, apply PRXCHANGE repetitively:

data want;
set have;
do until(string=lag(string));
  string=prxchange("s/\(([^\(\)]*)\)/þ$1ÿ/",-1,string);
end;
string=translate(compress(string,"()"),"()","þÿ");
run;

Of course, this implies rules to decide, e.g., which of the three parentheses in your fourth and fifth example is regarded as unmatched.

View solution in original post

19 REPLIES 19
PaigeMiller
Diamond | Level 26

@NN wrote:

Is there a quick way to remove unbalanced parentheses /brackets  from a string ?


Quick way? No.

 

You'd have to define a very clear and comprehensive set of rules and then program them up.

--
Paige Miller
ballardw
Super User

It is relatively easy to determine when a mismatch occurs. Consider:

data junk;
   x = "(This is) a test)";
   left = countc(x,"(");
   right= countc(x,")");
run;

If Left is not equal to right then there is a mismatch.

But determining which level of parsing a value for  matches is a good piece of writing a programming language syntax parser. For just one example of what goes on:

https://www.cs.iusb.edu/~dvrajito/teach/c311/c311_11_parser.html

 

The "trivial" mismatch case of exactly one of the parentheses appearing is easy enough.

data junk;
   x = "(This is a test";
   left = countc(x,"(");
   right= countc(x,")");
   if left ne right and max(left, right) = 1 then do;
      if left then y=compress(x,'(');
      else y=compress(x,')');
   end;
run;

But as soon as you get into your other examples you start having to parse a value character by character keeping track of the positions and order of ( and ) characters and then after you have examined the whole value then start applying rules and likely from the "middle" of something outwards.

Shmuel
Garnet | Level 18

Next is a tested code:

/***
  https://communities.sas.com/t5/SAS-Programming/Remove-unbalanced-parentheses-in-a-string/m-p/680071
  
  input - a string 
  target - remove unbalanced ()
  algorithm:
     scan the string 
	 keep positions of '(' followed by ')' and remove them
	 repeat until found:
	   - ( without ) and remove the (
	   - ) preceding (  and remove the )
	   - none found
	 insert couples of () back into thier positions
***/

%let maxl=30; /* max string length */
data have;
   length string $&maxl;  /* adapt to max length */
   input string $&maxl..;	 
cards;
(This is a test)
(This is a test
This is a test)
(This (is) a test
(This is) a test)
)This (is a test
This (is) (a) test)
; run;

data want;
 set have;
     /* use i as indicator to insert couple of parenthesis */
     array opn {&maxl} op1-op&maxl;  /* position of '(' */
	 array cls {&maxl} cl1-cl&maxl;  /* position of ')' */
	 
	 status = 0; /* 0=initial, 1=open found, 2=match found, 9=exit */
	 i=1;
	 do until (status=9);
	    link scan;
	 end;
	 /* insert parenthesis back to string */
	 if i>1 then do j=1 to i-1;
	    substr(string,opn(j),1) = '(';   
	    substr(string,cls(j),1) = ')';   
	 end;
	 keep string;
RETURN;
SCAN:
	 do j =1 to length(string);
	    ch = substr(string,j,1);
		if status = 0 and ch=')' then substr(string,j,1) = ' '; else
		if ch = '(' then do;
		   status=1; opn(i)=j;
		end;
		if status = 1 and ch = ')' then do;
		   status = 2; cls(i) = j;
		   substr(string,opn(i),1) = ' '; /* remove ( */
		   substr(string,j,1) = ' ';      /* remove ) */
		   i+1;
		 end;
		if j < length(string) then do;
		   if status=1 then substr(string,opn(i),1) = ' '; 
		   if status=2 then status = 0; 
		end; else status = 9;
	end;
RETURN;
RUN;

 

FreelanceReinh
Jade | Level 19

Hi @NN,

 

Quick way? Yes, if (like in your examples) there are no nested pairs of parentheses and you can specify two characters which are not present in the strings to work with.

data want;
set have;
string=translate(compress(prxchange("s/\(([^\(\)]*)\)/þ$1ÿ/",-1,string),"()"),"()","þÿ");
run;

The PRXCHANGE function replaces the parentheses in substrings of the form "(text without parentheses)" by replacement characters (I chose 'þ'='FE'x and 'ÿ'='FF'x, but you may want to use others). The remaining, hence unmatched parentheses are removed by the COMPRESS function. Finally, the TRANSLATE function restores the replaced parentheses.

 

If nested pairs of parentheses may occur in string, apply PRXCHANGE repetitively:

data want;
set have;
do until(string=lag(string));
  string=prxchange("s/\(([^\(\)]*)\)/þ$1ÿ/",-1,string);
end;
string=translate(compress(string,"()"),"()","þÿ");
run;

Of course, this implies rules to decide, e.g., which of the three parentheses in your fourth and fifth example is regarded as unmatched.

Ksharp
Super User
data have;
   input string $40.;	 
cards;
(This is a test)
(This is a test
This is a test)
(This (is) a test
(This is) a test)
)This (is a test
This (is) (a) test)
; run;

data want;
 set have;
 _string=translate(string,'  ','()');
 pid=prxparse('/\([^()]+\)/');
 s=1;e=length(string);
 call prxnext(pid,s,e,string,p,l);
 do while(p>0);
   substr(_string,p,1)='(';
   substr(_string,p+l-1,1)=')';
   call prxnext(pid,s,e,string,p,l);
 end;
keep string _string;
run;
MINX
Obsidian | Level 7
Nice codes. However, it seems to not work for parent parenthesis with nested parenthesis. for example:
(This (is)) (a) test)
Shmuel
Garnet | Level 18

@MINX thank you for the challenge.

I have added your example and some for and fixed the code.

I left all my debugging statement for those who may be interested in it - 

those lines start with /*DBG*/ putlog 

/***
  https://communities.sas.com/t5/SAS-Programming/Remove-unbalanced-parentheses-in-a-string/m-p/680071
***/

%let maxl=30; /* max string length */
data have;
   length string $&maxl;  
   input string $&maxl..;	 
cards;
How (are (you (my friend)))
(This (is)) (a) test) 
(This (is)) ((a) test 
(This (is)) ((a) test).)
(This is (a) test)
(This is a test
This is a test)
(This (is) a test
(This is) a test)
)This (is a test
This (is) (a) test)
; run;


data want;
 set have;
     /* level used as indicator to insert couple of parenthesis */
     array opn {&maxl} op1-op&maxl;  /* position of '(' */
	 array cls {&maxl} cl1-cl&maxl;  /* position of ')' */
	 _org = string; 
/*DBG*/ putlog _N_= string=;
 
/*   status:                */
/*      0=initial,          */
/*      1=open,             */
/*      2="(" follows "(",  */
/*      3=")" follows "(",  */
/*      9=exit              */

	 status = 0; level=0; delta=0;
	 max_level = 0;
	 do until (status=9);
	    link scan;
	 end;
	 /* insert parenthesis back to string */
	 do j=1 to length(string);
	    if opn(j) ^= . and cls(j) ^= . then do;;
	    substr(string,opn(j),1) = '(';   
	    substr(string,cls(j),1) = ')';   
	 end; end;
/*DBG*/ putlog _N_= string=;
	 keep _org string;
RETURN;
SCAN:
     lstr = length(string); 
/*DBG*/ putlog _N_= 'length(string)=' lstr;
	 do j =1 to lstr;
	    if status=9 then leave;
	    ch = substr(string,j,1);
	    if ch = '(' then link opn; else
	    if ch = ')' then link cls; 
    	if j ge length(string) then link end_scan;
	 end;
	 
RETURN;
OPN:
     level = max_level +1;
	 max_level = level;
     opn(level)=j;
     if status = 0 then status=1; else
     if status = 1 then status=2; 
     delta=delta+1; 
/*DBG*/ putlog j= ch= level= status= delta= op1= op2= op3= op4=;
     ch = ' ';
RETURN;
CLS:
/*DBG*/ putlog 'CLS: ' ch= j= level= status= delta= max_level=;
	if status = 0 then substr(string,j,1) = ' '; else
    if delta=0 then do; 
/*DBG*/ putlog 'CLS: 0';
       level = max_level+1;
       opn(level) = .; 
       cls(level) = j;  
       substr(string,j,1) = ' ';
       return; 
    end;
	if status=1 then do; * ) follows ( ;
	   cls(level) = j;
	   substr(string,opn(level),1) = ' '; /* remove ( */
	   substr(string,j,1) = ' ';          /* remove ) */
	   status=0;
	end; else
	if status=2 then do; * ) follows ( ( ;
	   cls(level) = j; 
	   substr(string,opn(level),1) = ' '; /* remove ( */
	   substr(string,j,1) = ' ';          /* remove ) */
	   if delta > 0 then do; delta=delta-1; level=level-1; end;
	   if delta=0 then status=1;
	end; else
	if status=3 then do;  * ) follows ) ;
	   level = level - delta;
	   if level=1 then status=2;
	end;
/*DBG*/ putlog 'CLS: ' ch= j= level= status= delta= /
         '>>>>> ' op1= op2= op3= op4= cl1= cl2= cl3= cl4= ;
RETURN;
END_SCAN:
    do j=1 to lstr;
       if opn(j) ne . and cls(j) = . then substr(string,opn(j),1) = ' ';
       if cls(j) ne . and opn(j) = . then substr(string,cls(j),1) = ' ';
    end;
    status = 9;
RETURN;
RUN;
	 

 

 

Shmuel
Garnet | Level 18

I have added some test cases in order to deal with nested parenthesis.

Next code is tested and seems to do the job as requested:

/***
  https://communities.sas.com/t5/SAS-Programming/Remove-unbalanced-parentheses-in-a-string/m-p/680071
***/

%let maxl=30; /* max string length */
data have;
   length string $&maxl;  
   input string $&maxl..;	 
cards;
How (are (you (my friend)))
(This is (a) test)
(This is a test
This is a test)
(This (is) a test
(This is) a test)
)This (is a test
This (is) (a) test)
; run;


data want;
 set have;
  *IF _N_ in (1 5 8);
     /* level used as indicator to insert couple of parenthesis */
     array opn {&maxl} op1-op&maxl;  /* position of '(' */
	 array cls {&maxl} cl1-cl&maxl;  /* position of ')' */
	 _org = string;
 
/*   status:                */
/*      0=initial,          */
/*      1=open,             */
/*      2="(" follows "(",  */
/*      3=")" follows "(",  */
/*      9=exit              */

	 status = 0; level=0; delta=0;
	 do until (status=9);
	    link scan;
	 end;
	 /* insert parenthesis back to string */
	 do j=1 to length(string);
	    if opn(j) ^= . and cls(j) ^= . then do;;
	    substr(string,opn(j),1) = '(';   
	    substr(string,cls(j),1) = ')';   
	 end; end;
	 keep _org string;
RETURN;
SCAN:
     lstr = length(string); 
	 do j =1 to lstr;
	    if status=9 then leave;
	    ch = substr(string,j,1);
	    if ch = '(' then link opn; else
	    if ch = ')' then link cls; 
    	if j ge length(string) then link end_scan;
	 end;
	 
RETURN;
OPN:
     level+1;
     opn(level)=j;
     if status = 0 then status=1; else
     if status = 1 then status=2; 
     delta=delta+1; 
     ch = ' ';
RETURN;
CLS:
  putlog 'CLS: ' ch= j= level= status= delta=;
	if status = 0 then substr(string,j,1) = ' '; else
	if status=1 then do; * ) follows ( ;
	   cls(level) = j;
	   substr(string,opn(level),1) = ' '; /* remove ( */
	   substr(string,j,1) = ' ';          /* remove ) */
	   status=0;
	end; else
	if status=2 then do; * ) follows ( ( ;
	   cls(level) = j; 
	   substr(string,opn(level),1) = ' '; /* remove ( */
	   substr(string,j,1) = ' ';          /* remove ) */
	   if delta > 0 then do; delta=delta-1; level=level-1; end;
	   if delta=0 then status=1;
	end; else
	if status=3 then do;  * ) follows ) ;
	   level = level - delta;
	   if level=1 then status=2;
	end;
RETURN;
END_SCAN:
    do j=1 to lstr;
       if opn(j) ne . and cls(j) = . then substr(string,opn(j),1) = ' ';
       if cls(j) ne . and opn(j) = . then substr(string,cls(j),1) = ' ';
    end;
    status = 9;
RETURN;
RUN;
NN
Quartz | Level 8 NN
Quartz | Level 8
Thank you Everyone for your amazing solutions.
Sorry for the delay in my reply.

Most of these solutions work perfectly for my requirement.
But i shall be going ahead with FreelanceReinhards solution as i was looking a regex based compact solution.

Thanks Again to all the Gurus.
Shmuel
Garnet | Level 18

 

@NN , though @FreelanceReinh's logic is brilliant

but the regex did not result, for my understanding, as  expected.

 

Here is the log of running the code with minor modification, in order to display

both the original value and the outcome value in the log:

 

73         data want_F;
 74          set have(firstobs=5);
 75              _org = string;
 76              string=translate(compress(prxchange("s/\(([^\(\)]*)\)/þ$1ÿ/",-1,string),"()"),"()","þÿ");
 77              putlog _N_= _org= / '        ' string=;
 78         run;
 
 _N_=1 _org=(This is (a) test)
         string=This is ()a(  test
 _N_=2 _org=(This is a test
         string=This is a test
 _N_=3 _org=This is a test)
         string=This is a test
 _N_=4 _org=(This (is) a test
         string=This ()is(  a test
 _N_=5 _org=(This is) a test)
         string=()This is(  a test
 _N_=6 _org=)This (is a test
         string=This is a test
 _N_=7 _org=This (is) (a) test)
         string=This ()is(  ()a(  test
 NOTE: There were 7 observations read from the data set WORK.HAVE.

 

 

 

I have coded @FreelanceReinh's logic as a sas step and it is realy much shorter then

 

mine previous solution. Next is my new code based on this logic:

 

%let maxl=30; /* max string length */
data have;
   length string $&maxl;  
   input string $&maxl..;	 
cards;
How (are (you (my friend)))
(This (is)) (a) test) 
(This (is)) ((a) test 
(This (is)) ((a) test).)
(This is (a) test)
(This is a test
This is a test)
(This (is) a test
(This is) a test)
)This (is a test
This (is) (a) test)
; run;


data want;
 set have;
     _org = string;
	 do until (status=9);
	    link scan;
	 end;
	 string = translate(string,'  ()','()'||"FAFB"x);
     put _N_= _org= / '     ' string=;
RETURN;
SCAN:
    do i=1 to length(string);
       status = 9;
       if substr(string,i,1) = '(' then p = i;
       if substr(string,i,1) = ')' and p>0
          then do;
          substr(string,p,1) = 'FA'x;
          substr(string,i,1) = 'FB'x;
          p=0;
          status=0;
          leave;
       end;
    end;        
RETURN;
RUN;

the outcome copied from the log is:

 

 

 _N_=1 _org=How (are (you (my friend)))
     string=How (are (you (my friend)))
 _N_=2 _org=(This (is)) (a) test)
     string=(This (is)) (a) test
 _N_=3 _org=(This (is)) ((a) test
     string=(This (is))  (a) test
 _N_=4 _org=(This (is)) ((a) test).)
     string=(This (is)) ((a) test).
 _N_=5 _org=(This is (a) test)
     string=(This is (a) test)
 _N_=6 _org=(This is a test
     string=This is a test
 _N_=7 _org=This is a test)
     string=This is a test
 _N_=8 _org=(This (is) a test
     string=This (is) a test
 _N_=9 _org=(This is) a test)
     string=(This is) a test
 _N_=10 _org=)This (is a test
     string=This  is a test
 _N_=11 _org=This (is) (a) test)
     string=This (is) (a) test
 NOTE: There were 11 observations read from the data set WORK.HAVE.

 

 

NN
Quartz | Level 8 NN
Quartz | Level 8

Hi @Shmuel ,

I had tested the second code he had shared with the loop and that seemed to work fine.

I ran the same with the records you shared and below is the result. Is there something that i am missing here ?

23         data want;
24         set have;
25         _org=string;
26         do until(string=lag(string));
27           string=prxchange("s/\(([^\(\)]*)\)/þ$1ÿ/",-1,string);
28         end;
29         string=translate(compress(string,"()"),"()","þÿ");
30         put _N_= _org= / '     ' string=;
31         run;

_N_=1 _org=How (are (you (my friend)))
     string=How (are (you (my friend)))
_N_=2 _org=(This (is)) (a) test)
     string=(This (is)) (a) test
_N_=3 _org=(This (is)) ((a) test
     string=(This (is)) (a) test
_N_=4 _org=(This (is)) ((a) test).)
     string=(This (is)) ((a) test).
_N_=5 _org=(This is (a) test)
     string=(This is (a) test)
_N_=6 _org=(This is a test
     string=This is a test
_N_=7 _org=This is a test)
     string=This is a test
_N_=8 _org=(This (is) a test
     string=This (is) a test
_N_=9 _org=(This is) a test)
     string=(This is) a test
_N_=10 _org=)This (is a test
     string=This is a test
_N_=11 _org=This (is) (a) test)
     string=This (is) (a) test
FreelanceReinh
Jade | Level 19

Hi @Shmuel,

 

Thanks for translating my regex solution into a data step using more elementary functions. I haven't checked the "translation," but it might be helpful for later readers who find regular expressions too cryptic.

 

I have checked the log you posted and I cannot replicate the results you obtained. Here is my log using the code copied from yours and also using your HAVE dataset (code):

105  data want_F;
106  set have(firstobs=5);
107  _org = string;
108  string=translate(compress(prxchange("s/\(([^\(\)]*)\)/þ$1ÿ/",-1,string),"()"),"()","þÿ");
109  putlog _N_= _org= / '        ' string=;
110  run;

_N_=1 _org=(This is (a) test)
        string=This is (a) test
_N_=2 _org=(This is a test
        string=This is a test
_N_=3 _org=This is a test)
        string=This is a test
_N_=4 _org=(This (is) a test
        string=This (is) a test
_N_=5 _org=(This is) a test)
        string=(This is) a test
_N_=6 _org=)This (is a test
        string=This is a test
_N_=7 _org=This (is) (a) test)
        string=This (is) (a) test
NOTE: There were 7 observations read from the data set WORK.HAVE.

Not sure how the strange results on your computer emerged. (Perhaps some sort of encoding issue?) To investigate it, I would store the intermediate results (i.e., of PRXCHANGE alone, then of COMPRESS) in separate variables. As mentioned earlier, the code version which also handles nested pairs of parentheses is that with the DO-UNTIL loop, but this is not the point here.

Shmuel
Garnet | Level 18

@NN, it is realy strange and may be it is a result of the environment differences.

I have run the second code and got the same issue as before:

The code and sample data -

%let maxl=30; /* max string length */
data have;
   length string $&maxl;  
   input string $&maxl..;	 
cards;
How (are (you (my friend)))
(This (is)) (a) test) 
(This (is)) ((a) test 
(This (is)) ((a) test).)
(This is (a) test)
(This is a test
This is a test)
(This (is) a test
(This is) a test)
)This (is a test
This (is) (a) test)
; run;
data want_F;
 set have;
     _org = string;
     do until(string=lag(string));
        string=prxchange("s/\(([^\(\)]*)\)/þ$1ÿ/",-1,string);
     end;
     string=translate(compress(string,"()"),"()","þÿ");
     putlog _N_= _org= / '        ' string=;
run;	 

the log with results -

122        data want_F;
 123         set have;
 124             _org = string;
 125             do until(string=lag(string));
 126                string=prxchange("s/\(([^\(\)]*)\)/þ$1ÿ/",-1,string);
 127             end;
 128             string=translate(compress(string,"()"),"()","þÿ");
 129             putlog _N_= _org= / '        ' string=;
 130        run;
 
 _N_=1 _org=How (are (you (my friend)))
         string=How are ()you ()my friend( (
 _N_=2 _org=(This (is)) (a) test)
         string=()This ()is( (  ()a(  test
 _N_=3 _org=(This (is)) ((a) test
         string=()This ()is( (  ()a(  test
 _N_=4 _org=(This (is)) ((a) test).)
         string=()This ()is( (  ()()a(  test(
 _N_=5 _org=(This is (a) test)
         string=()This is ()a(  test(
 _N_=6 _org=(This is a test
         string=This is a test
 _N_=7 _org=This is a test)
         string=This is a test
 _N_=8 _org=(This (is) a test
         string=This ()is(  a test
 _N_=9 _org=(This is) a test)
         string=()This is(  a test
 _N_=10 _org=)This (is a test
         string=This is a test
 _N_=11 _org=This (is) (a) test)
         string=This ()is(  ()a(  test
 NOTE: There were 11 observations read from the data set WORK.HAVE.

so, if it works fine at your environment, ignore the issue I have.

 

I'm using SAS UE with Oracle VM virtual box, windows 10.

I have no explanation for the issue.

Shmuel
Garnet | Level 18

While @FreelanceReinh chosed 'þ'='FE'x and 'ÿ'='FF'x, 

those characters behave differently in my environment, that is UTF-8:

 73         data a;
 74           x="þÿ";
 75           putlog x= $hex.;
 76         run;
 
 x=C3BEC3BF

and this probably explains my issue.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 19 replies
  • 3006 views
  • 7 likes
  • 7 in conversation