SAS Programming

NN · Posted 08-28-2020 02:27 PM

Hi All,

Is there a quick way to remove unbalanced parentheses /brackets from a string ?

For example in the table below can i get the column "want" as result with just the matched parentheses retained ?

HAVE	WANT
(This is a test)	(This is a test)
(This is a test	This is a test
This is a test)	This is a test
(This (is) a test	This (is) a test
(This is) a test)	(This is) a test
)This (is a test	This is a test

FreelanceReinh · Posted 08-29-2020 06:04 AM

Hi @NN,

Quick way? Yes, if (like in your examples) there are no nested pairs of parentheses and you can specify two characters which are not present in the strings to work with.

data want;
set have;
string=translate(compress(prxchange("s/\(([^\(\)]*)\)/þ$1ÿ/",-1,string),"()"),"()","þÿ");
run;

The PRXCHANGE function replaces the parentheses in substrings of the form "(text without parentheses)" by replacement characters (I chose 'þ'='FE'x and 'ÿ'='FF'x, but you may want to use others). The remaining, hence unmatched parentheses are removed by the COMPRESS function. Finally, the TRANSLATE function restores the replaced parentheses.

If nested pairs of parentheses may occur in string, apply PRXCHANGE repetitively:

data want;
set have;
do until(string=lag(string));
  string=prxchange("s/\(([^\(\)]*)\)/þ$1ÿ/",-1,string);
end;
string=translate(compress(string,"()"),"()","þÿ");
run;

Of course, this implies rules to decide, e.g., which of the three parentheses in your fourth and fifth example is regarded as unmatched.

View solution in original post

PaigeMiller · Posted 08-28-2020 04:15 PM

@NN wrote:

Is there a quick way to remove unbalanced parentheses /brackets from a string ?

Quick way? No.

You'd have to define a very clear and comprehensive set of rules and then program them up.

--
Paige Miller

ballardw · Posted 08-28-2020 04:39 PM

It is relatively easy to determine when a mismatch occurs. Consider:

data junk;
   x = "(This is) a test)";
   left = countc(x,"(");
   right= countc(x,")");
run;

If Left is not equal to right then there is a mismatch.

But determining which level of parsing a value for matches is a good piece of writing a programming language syntax parser. For just one example of what goes on:

https://www.cs.iusb.edu/~dvrajito/teach/c311/c311_11_parser.html

The "trivial" mismatch case of exactly one of the parentheses appearing is easy enough.

data junk;
   x = "(This is a test";
   left = countc(x,"(");
   right= countc(x,")");
   if left ne right and max(left, right) = 1 then do;
      if left then y=compress(x,'(');
      else y=compress(x,')');
   end;
run;

But as soon as you get into your other examples you start having to parse a value character by character keeping track of the positions and order of ( and ) characters and then after you have examined the whole value then start applying rules and likely from the "middle" of something outwards.

Shmuel · Posted 08-28-2020 10:13 PM

Next is a tested code:

/***
  https://communities.sas.com/t5/SAS-Programming/Remove-unbalanced-parentheses-in-a-string/m-p/680071
  
  input - a string 
  target - remove unbalanced ()
  algorithm:
     scan the string 
	 keep positions of '(' followed by ')' and remove them
	 repeat until found:
	   - ( without ) and remove the (
	   - ) preceding (  and remove the )
	   - none found
	 insert couples of () back into thier positions
***/

%let maxl=30; /* max string length */
data have;
   length string $&maxl;  /* adapt to max length */
   input string $&maxl..;	 
cards;
(This is a test)
(This is a test
This is a test)
(This (is) a test
(This is) a test)
)This (is a test
This (is) (a) test)
; run;

data want;
 set have;
     /* use i as indicator to insert couple of parenthesis */
     array opn {&maxl} op1-op&maxl;  /* position of '(' */
	 array cls {&maxl} cl1-cl&maxl;  /* position of ')' */
	 
	 status = 0; /* 0=initial, 1=open found, 2=match found, 9=exit */
	 i=1;
	 do until (status=9);
	    link scan;
	 end;
	 /* insert parenthesis back to string */
	 if i>1 then do j=1 to i-1;
	    substr(string,opn(j),1) = '(';   
	    substr(string,cls(j),1) = ')';   
	 end;
	 keep string;
RETURN;
SCAN:
	 do j =1 to length(string);
	    ch = substr(string,j,1);
		if status = 0 and ch=')' then substr(string,j,1) = ' '; else
		if ch = '(' then do;
		   status=1; opn(i)=j;
		end;
		if status = 1 and ch = ')' then do;
		   status = 2; cls(i) = j;
		   substr(string,opn(i),1) = ' '; /* remove ( */
		   substr(string,j,1) = ' ';      /* remove ) */
		   i+1;
		 end;
		if j < length(string) then do;
		   if status=1 then substr(string,opn(i),1) = ' '; 
		   if status=2 then status = 0; 
		end; else status = 9;
	end;
RETURN;
RUN;

FreelanceReinh · Posted 08-29-2020 06:04 AM

Hi @NN,

Quick way? Yes, if (like in your examples) there are no nested pairs of parentheses and you can specify two characters which are not present in the strings to work with.

data want;
set have;
string=translate(compress(prxchange("s/\(([^\(\)]*)\)/þ$1ÿ/",-1,string),"()"),"()","þÿ");
run;

The PRXCHANGE function replaces the parentheses in substrings of the form "(text without parentheses)" by replacement characters (I chose 'þ'='FE'x and 'ÿ'='FF'x, but you may want to use others). The remaining, hence unmatched parentheses are removed by the COMPRESS function. Finally, the TRANSLATE function restores the replaced parentheses.

If nested pairs of parentheses may occur in string, apply PRXCHANGE repetitively:

data want;
set have;
do until(string=lag(string));
  string=prxchange("s/\(([^\(\)]*)\)/þ$1ÿ/",-1,string);
end;
string=translate(compress(string,"()"),"()","þÿ");
run;

Of course, this implies rules to decide, e.g., which of the three parentheses in your fourth and fifth example is regarded as unmatched.

Ksharp · Posted 08-29-2020 09:43 AM

data have;
   input string $40.;	 
cards;
(This is a test)
(This is a test
This is a test)
(This (is) a test
(This is) a test)
)This (is a test
This (is) (a) test)
; run;

data want;
 set have;
 _string=translate(string,'  ','()');
 pid=prxparse('/\([^()]+\)/');
 s=1;e=length(string);
 call prxnext(pid,s,e,string,p,l);
 do while(p>0);
   substr(_string,p,1)='(';
   substr(_string,p+l-1,1)=')';
   call prxnext(pid,s,e,string,p,l);
 end;
keep string _string;
run;

MINX · Posted 08-31-2020 11:34 AM

Nice codes. However, it seems to not work for parent parenthesis with nested parenthesis. for example:
(This (is)) (a) test)

Shmuel · Posted 09-01-2020 01:26 AM

@MINX thank you for the challenge.

I have added your example and some for and fixed the code.

I left all my debugging statement for those who may be interested in it -

those lines start with /*DBG*/ putlog

/***
  https://communities.sas.com/t5/SAS-Programming/Remove-unbalanced-parentheses-in-a-string/m-p/680071
***/

%let maxl=30; /* max string length */
data have;
   length string $&maxl;  
   input string $&maxl..;	 
cards;
How (are (you (my friend)))
(This (is)) (a) test) 
(This (is)) ((a) test 
(This (is)) ((a) test).)
(This is (a) test)
(This is a test
This is a test)
(This (is) a test
(This is) a test)
)This (is a test
This (is) (a) test)
; run;


data want;
 set have;
     /* level used as indicator to insert couple of parenthesis */
     array opn {&maxl} op1-op&maxl;  /* position of '(' */
	 array cls {&maxl} cl1-cl&maxl;  /* position of ')' */
	 _org = string; 
/*DBG*/ putlog _N_= string=;
 
/*   status:                */
/*      0=initial,          */
/*      1=open,             */
/*      2="(" follows "(",  */
/*      3=")" follows "(",  */
/*      9=exit              */

	 status = 0; level=0; delta=0;
	 max_level = 0;
	 do until (status=9);
	    link scan;
	 end;
	 /* insert parenthesis back to string */
	 do j=1 to length(string);
	    if opn(j) ^= . and cls(j) ^= . then do;;
	    substr(string,opn(j),1) = '(';   
	    substr(string,cls(j),1) = ')';   
	 end; end;
/*DBG*/ putlog _N_= string=;
	 keep _org string;
RETURN;
SCAN:
     lstr = length(string); 
/*DBG*/ putlog _N_= 'length(string)=' lstr;
	 do j =1 to lstr;
	    if status=9 then leave;
	    ch = substr(string,j,1);
	    if ch = '(' then link opn; else
	    if ch = ')' then link cls; 
    	if j ge length(string) then link end_scan;
	 end;
	 
RETURN;
OPN:
     level = max_level +1;
	 max_level = level;
     opn(level)=j;
     if status = 0 then status=1; else
     if status = 1 then status=2; 
     delta=delta+1; 
/*DBG*/ putlog j= ch= level= status= delta= op1= op2= op3= op4=;
     ch = ' ';
RETURN;
CLS:
/*DBG*/ putlog 'CLS: ' ch= j= level= status= delta= max_level=;
	if status = 0 then substr(string,j,1) = ' '; else
    if delta=0 then do; 
/*DBG*/ putlog 'CLS: 0';
       level = max_level+1;
       opn(level) = .; 
       cls(level) = j;  
       substr(string,j,1) = ' ';
       return; 
    end;
	if status=1 then do; * ) follows ( ;
	   cls(level) = j;
	   substr(string,opn(level),1) = ' '; /* remove ( */
	   substr(string,j,1) = ' ';          /* remove ) */
	   status=0;
	end; else
	if status=2 then do; * ) follows ( ( ;
	   cls(level) = j; 
	   substr(string,opn(level),1) = ' '; /* remove ( */
	   substr(string,j,1) = ' ';          /* remove ) */
	   if delta > 0 then do; delta=delta-1; level=level-1; end;
	   if delta=0 then status=1;
	end; else
	if status=3 then do;  * ) follows ) ;
	   level = level - delta;
	   if level=1 then status=2;
	end;
/*DBG*/ putlog 'CLS: ' ch= j= level= status= delta= /
         '>>>>> ' op1= op2= op3= op4= cl1= cl2= cl3= cl4= ;
RETURN;
END_SCAN:
    do j=1 to lstr;
       if opn(j) ne . and cls(j) = . then substr(string,opn(j),1) = ' ';
       if cls(j) ne . and opn(j) = . then substr(string,cls(j),1) = ' ';
    end;
    status = 9;
RETURN;
RUN;

Shmuel · Posted 08-31-2020 10:31 AM

I have added some test cases in order to deal with nested parenthesis.

Next code is tested and seems to do the job as requested:

/***
  https://communities.sas.com/t5/SAS-Programming/Remove-unbalanced-parentheses-in-a-string/m-p/680071
***/

%let maxl=30; /* max string length */
data have;
   length string $&maxl;  
   input string $&maxl..;	 
cards;
How (are (you (my friend)))
(This is (a) test)
(This is a test
This is a test)
(This (is) a test
(This is) a test)
)This (is a test
This (is) (a) test)
; run;


data want;
 set have;
  *IF _N_ in (1 5 8);
     /* level used as indicator to insert couple of parenthesis */
     array opn {&maxl} op1-op&maxl;  /* position of '(' */
	 array cls {&maxl} cl1-cl&maxl;  /* position of ')' */
	 _org = string;
 
/*   status:                */
/*      0=initial,          */
/*      1=open,             */
/*      2="(" follows "(",  */
/*      3=")" follows "(",  */
/*      9=exit              */

	 status = 0; level=0; delta=0;
	 do until (status=9);
	    link scan;
	 end;
	 /* insert parenthesis back to string */
	 do j=1 to length(string);
	    if opn(j) ^= . and cls(j) ^= . then do;;
	    substr(string,opn(j),1) = '(';   
	    substr(string,cls(j),1) = ')';   
	 end; end;
	 keep _org string;
RETURN;
SCAN:
     lstr = length(string); 
	 do j =1 to lstr;
	    if status=9 then leave;
	    ch = substr(string,j,1);
	    if ch = '(' then link opn; else
	    if ch = ')' then link cls; 
    	if j ge length(string) then link end_scan;
	 end;
	 
RETURN;
OPN:
     level+1;
     opn(level)=j;
     if status = 0 then status=1; else
     if status = 1 then status=2; 
     delta=delta+1; 
     ch = ' ';
RETURN;
CLS:
  putlog 'CLS: ' ch= j= level= status= delta=;
	if status = 0 then substr(string,j,1) = ' '; else
	if status=1 then do; * ) follows ( ;
	   cls(level) = j;
	   substr(string,opn(level),1) = ' '; /* remove ( */
	   substr(string,j,1) = ' ';          /* remove ) */
	   status=0;
	end; else
	if status=2 then do; * ) follows ( ( ;
	   cls(level) = j; 
	   substr(string,opn(level),1) = ' '; /* remove ( */
	   substr(string,j,1) = ' ';          /* remove ) */
	   if delta > 0 then do; delta=delta-1; level=level-1; end;
	   if delta=0 then status=1;
	end; else
	if status=3 then do;  * ) follows ) ;
	   level = level - delta;
	   if level=1 then status=2;
	end;
RETURN;
END_SCAN:
    do j=1 to lstr;
       if opn(j) ne . and cls(j) = . then substr(string,opn(j),1) = ' ';
       if cls(j) ne . and opn(j) = . then substr(string,cls(j),1) = ' ';
    end;
    status = 9;
RETURN;
RUN;

NN · Posted 09-01-2020 06:08 AM

Thank you Everyone for your amazing solutions.
Sorry for the delay in my reply.

Most of these solutions work perfectly for my requirement.
But i shall be going ahead with FreelanceReinhards solution as i was looking a regex based compact solution.

Thanks Again to all the Gurus.

Shmuel · Posted 09-01-2020 11:23 AM

@NN , though @FreelanceReinh's logic is brilliant

but the regex did not result, for my understanding, as expected.

Here is the log of running the code with minor modification, in order to display

both the original value and the outcome value in the log:

73         data want_F;
 74          set have(firstobs=5);
 75              _org = string;
 76              string=translate(compress(prxchange("s/\(([^\(\)]*)\)/þ$1ÿ/",-1,string),"()"),"()","þÿ");
 77              putlog _N_= _org= / '        ' string=;
 78         run;
 
 _N_=1 _org=(This is (a) test)
         string=This is ()a(  test
 _N_=2 _org=(This is a test
         string=This is a test
 _N_=3 _org=This is a test)
         string=This is a test
 _N_=4 _org=(This (is) a test
         string=This ()is(  a test
 _N_=5 _org=(This is) a test)
         string=()This is(  a test
 _N_=6 _org=)This (is a test
         string=This is a test
 _N_=7 _org=This (is) (a) test)
         string=This ()is(  ()a(  test
 NOTE: There were 7 observations read from the data set WORK.HAVE.

I have coded @FreelanceReinh's logic as a sas step and it is realy much shorter then

mine previous solution. Next is my new code based on this logic:

%let maxl=30; /* max string length */
data have;
   length string $&maxl;  
   input string $&maxl..;	 
cards;
How (are (you (my friend)))
(This (is)) (a) test) 
(This (is)) ((a) test 
(This (is)) ((a) test).)
(This is (a) test)
(This is a test
This is a test)
(This (is) a test
(This is) a test)
)This (is a test
This (is) (a) test)
; run;


data want;
 set have;
     _org = string;
	 do until (status=9);
	    link scan;
	 end;
	 string = translate(string,'  ()','()'||"FAFB"x);
     put _N_= _org= / '     ' string=;
RETURN;
SCAN:
    do i=1 to length(string);
       status = 9;
       if substr(string,i,1) = '(' then p = i;
       if substr(string,i,1) = ')' and p>0
          then do;
          substr(string,p,1) = 'FA'x;
          substr(string,i,1) = 'FB'x;
          p=0;
          status=0;
          leave;
       end;
    end;        
RETURN;
RUN;

the outcome copied from the log is:

 _N_=1 _org=How (are (you (my friend)))
     string=How (are (you (my friend)))
 _N_=2 _org=(This (is)) (a) test)
     string=(This (is)) (a) test
 _N_=3 _org=(This (is)) ((a) test
     string=(This (is))  (a) test
 _N_=4 _org=(This (is)) ((a) test).)
     string=(This (is)) ((a) test).
 _N_=5 _org=(This is (a) test)
     string=(This is (a) test)
 _N_=6 _org=(This is a test
     string=This is a test
 _N_=7 _org=This is a test)
     string=This is a test
 _N_=8 _org=(This (is) a test
     string=This (is) a test
 _N_=9 _org=(This is) a test)
     string=(This is) a test
 _N_=10 _org=)This (is a test
     string=This  is a test
 _N_=11 _org=This (is) (a) test)
     string=This (is) (a) test
 NOTE: There were 11 observations read from the data set WORK.HAVE.

NN · Posted 09-01-2020 03:17 PM

Hi @Shmuel ,

I had tested the second code he had shared with the loop and that seemed to work fine.

I ran the same with the records you shared and below is the result. Is there something that i am missing here ?

23         data want;
24         set have;
25         _org=string;
26         do until(string=lag(string));
27           string=prxchange("s/\(([^\(\)]*)\)/þ$1ÿ/",-1,string);
28         end;
29         string=translate(compress(string,"()"),"()","þÿ");
30         put _N_= _org= / '     ' string=;
31         run;

_N_=1 _org=How (are (you (my friend)))
     string=How (are (you (my friend)))
_N_=2 _org=(This (is)) (a) test)
     string=(This (is)) (a) test
_N_=3 _org=(This (is)) ((a) test
     string=(This (is)) (a) test
_N_=4 _org=(This (is)) ((a) test).)
     string=(This (is)) ((a) test).
_N_=5 _org=(This is (a) test)
     string=(This is (a) test)
_N_=6 _org=(This is a test
     string=This is a test
_N_=7 _org=This is a test)
     string=This is a test
_N_=8 _org=(This (is) a test
     string=This (is) a test
_N_=9 _org=(This is) a test)
     string=(This is) a test
_N_=10 _org=)This (is a test
     string=This is a test
_N_=11 _org=This (is) (a) test)
     string=This (is) (a) test

FreelanceReinh · Posted 09-01-2020 04:00 PM

Hi @Shmuel,

Thanks for translating my regex solution into a data step using more elementary functions. I haven't checked the "translation," but it might be helpful for later readers who find regular expressions too cryptic.

I have checked the log you posted and I cannot replicate the results you obtained. Here is my log using the code copied from yours and also using your HAVE dataset (code):

105  data want_F;
106  set have(firstobs=5);
107  _org = string;
108  string=translate(compress(prxchange("s/\(([^\(\)]*)\)/þ$1ÿ/",-1,string),"()"),"()","þÿ");
109  putlog _N_= _org= / '        ' string=;
110  run;

_N_=1 _org=(This is (a) test)
        string=This is (a) test
_N_=2 _org=(This is a test
        string=This is a test
_N_=3 _org=This is a test)
        string=This is a test
_N_=4 _org=(This (is) a test
        string=This (is) a test
_N_=5 _org=(This is) a test)
        string=(This is) a test
_N_=6 _org=)This (is a test
        string=This is a test
_N_=7 _org=This (is) (a) test)
        string=This (is) (a) test
NOTE: There were 7 observations read from the data set WORK.HAVE.

Not sure how the strange results on your computer emerged. (Perhaps some sort of encoding issue?) To investigate it, I would store the intermediate results (i.e., of PRXCHANGE alone, then of COMPRESS) in separate variables. As mentioned earlier, the code version which also handles nested pairs of parentheses is that with the DO-UNTIL loop, but this is not the point here.

Shmuel · Posted 09-01-2020 04:18 PM

@NN, it is realy strange and may be it is a result of the environment differences.

I have run the second code and got the same issue as before:

The code and sample data -

%let maxl=30; /* max string length */
data have;
   length string $&maxl;  
   input string $&maxl..;	 
cards;
How (are (you (my friend)))
(This (is)) (a) test) 
(This (is)) ((a) test 
(This (is)) ((a) test).)
(This is (a) test)
(This is a test
This is a test)
(This (is) a test
(This is) a test)
)This (is a test
This (is) (a) test)
; run;
data want_F;
 set have;
     _org = string;
     do until(string=lag(string));
        string=prxchange("s/\(([^\(\)]*)\)/þ$1ÿ/",-1,string);
     end;
     string=translate(compress(string,"()"),"()","þÿ");
     putlog _N_= _org= / '        ' string=;
run;

the log with results -

122        data want_F;
 123         set have;
 124             _org = string;
 125             do until(string=lag(string));
 126                string=prxchange("s/\(([^\(\)]*)\)/þ$1ÿ/",-1,string);
 127             end;
 128             string=translate(compress(string,"()"),"()","þÿ");
 129             putlog _N_= _org= / '        ' string=;
 130        run;
 
 _N_=1 _org=How (are (you (my friend)))
         string=How are ()you ()my friend( (
 _N_=2 _org=(This (is)) (a) test)
         string=()This ()is( (  ()a(  test
 _N_=3 _org=(This (is)) ((a) test
         string=()This ()is( (  ()a(  test
 _N_=4 _org=(This (is)) ((a) test).)
         string=()This ()is( (  ()()a(  test(
 _N_=5 _org=(This is (a) test)
         string=()This is ()a(  test(
 _N_=6 _org=(This is a test
         string=This is a test
 _N_=7 _org=This is a test)
         string=This is a test
 _N_=8 _org=(This (is) a test
         string=This ()is(  a test
 _N_=9 _org=(This is) a test)
         string=()This is(  a test
 _N_=10 _org=)This (is a test
         string=This is a test
 _N_=11 _org=This (is) (a) test)
         string=This ()is(  ()a(  test
 NOTE: There were 11 observations read from the data set WORK.HAVE.

so, if it works fine at your environment, ignore the issue I have.

I'm using SAS UE with Oracle VM virtual box, windows 10.

I have no explanation for the issue.

Shmuel · Posted 09-01-2020 06:29 PM

While @FreelanceReinh chosed 'þ'='FE'x and 'ÿ'='FF'x,

those characters behave differently in my environment, that is UTF-8:

 73         data a;
 74           x="þÿ";
 75           putlog x= $hex.;
 76         run;
 
 x=C3BEC3BF

and this probably explains my issue.

SAS Programming

Remove unbalanced parentheses in a string

Re: Remove unbalanced parentheses in a string

Re: Remove unbalanced parentheses in a string

Re: Remove unbalanced parentheses in a string

Re: Remove unbalanced parentheses in a string

Re: Remove unbalanced parentheses in a string

Re: Remove unbalanced parentheses in a string

Re: Remove unbalanced parentheses in a string

Re: Remove unbalanced parentheses in a string

Re: Remove unbalanced parentheses in a string

Re: Remove unbalanced parentheses in a string

Re: Remove unbalanced parentheses in a string

Re: Remove unbalanced parentheses in a string

Re: Remove unbalanced parentheses in a string

Re: Remove unbalanced parentheses in a string

Re: Remove unbalanced parentheses in a string

Remove Character in String

remove text inside nested parentheses using PRX

How to remove specific character from a string value.

remove parentheses from percent format

Removing completely the parentheses expression

Follow Us

What is...

SAS Programming

Our biggest data and AI event of the year.

SAS Training: Just a Click Away

Follow Us

What is...