DATA Step, Macro, Functions and more

Back reference in regex

Accepted Solution Solved
Reply
PROC Star
Posts: 1,760
Accepted Solution

Back reference in regex

2 questions here. Since they are likely to be replied to by the same knowledgeable person, I ask them together.

Thank you for your lights.

1- Characters from the original string get in the replacement string without being requested

data _null_;

  STR   = 'abcc'||'03'x||'d';  

                                              * We match abcc, \3 means repeat group # 3 ;

  REGEX = '/(a)(\w*)(.+)\3/ '* grp1=a grp2=b grp3=c                     ;

  link parse;                   * Changed: 1=a 2=b 3=c|_d ;

 

                                              * We match abcc'03'x, \03 means octal 3 ;

  REGEX = '/(a)(\w*)(.+)\03/'* grp1=a grp2=bc grp3=c                    ;

  link parse;                   * Changed: 1=a 2=bc 3=c|d ;

  stop;

  parse:

  PRX1 = prxparse(REGEX);   

  call prxsubstr(PRX1, STR, POS, LEN);

  put POS= LEN=; 

  PRX1 = prxparse(cats('s',REGEX,'1=\1 2=\2 3=\3|/'));   

  A= prxchange(PRX1, -1, STR);

  put  'Changed: ' A /; 

run;

Why does the character d get into the changed string (after the pipe character)? I never asked for it.

2- Group number 10 is created but not reused

data _null_;

  STR  = 'abcdefghijj'||'08'x||'b';                  * 8 hex = 10 octal;

                                                                             * We match abcdefghijj, \10 means group # 10;

  REGEX = '/(a)(b)(c)(d)(e)(f)(g)(h)(\w*)(.+)\10/ '; * grp8=h grp9=i grp10=j      POS=1 LEN=11   ;

  link parse;                                         * Changed: 8=h 9=i 10=a0|_b                 ;

 

  REGEX = '/(a)(b)(c)(d)(e)(f)(g)(h)(\w*)(.+)\010/'; * We match abcdefghijj'08'x, \010 means octal 10;

  link parse;                                         * grp8=h grp9=ij grp10=j      POS=1 LEN=12   ;

  stop;                                               * Changed: 8=h 9=ij 10=a0|b                    ;

  parse:

  PRX1 = prxparse(REGEX);   

   call prxsubstr(PRX1, STR, POS, LEN);

   put POS= LEN=; 

  PRX1 = prxparse(cats('s',REGEX,'8=\8 9=\9 10=\10|/'));   

  A= prxchange(PRX1, -1, STR);

  put  'Changed: ' A /; 

run;

Group number 10 is created as shown by the length of the matched string (LEN= ), but when I try to reuse it (after 10=),

\10 is interpreted at group 1 then zero rather than group 10. Is this a SAS limitation or am I doing something I shouldn't?


Accepted Solutions
Solution
‎08-18-2015 03:03 AM
Frequent Contributor
Posts: 85

Re: Back reference in regex

The PRXCHANGE changes the matched sub-string, within the full source string.  The letter d at the end of STR is not part of the match and replace, so is retained in the result.  Try adding a Z at the front of the STR value, for example,  and you'll see this more clearly.

When specifying the replacements use a $ instead of the \ to specify the groups:

PRX1 = prxparse(cats('s',REGEX,'8=$8 9=$9 10=$10|/'));   

View solution in original post


All Replies
Solution
‎08-18-2015 03:03 AM
Frequent Contributor
Posts: 85

Re: Back reference in regex

The PRXCHANGE changes the matched sub-string, within the full source string.  The letter d at the end of STR is not part of the match and replace, so is retained in the result.  Try adding a Z at the front of the STR value, for example,  and you'll see this more clearly.

When specifying the replacements use a $ instead of the \ to specify the groups:

PRX1 = prxparse(cats('s',REGEX,'8=$8 9=$9 10=$10|/'));   

PROC Star
Posts: 1,760

Re: Back reference in regex

Posted in reply to JerryLeBreton

Thank you!

1- Why do \n substitution groups work for single digit groups? Is it a tolerance?

2- abcdefg are not are not carried over to the changed string. So to avoid having characters being copied over they have to be in groups?

  or between groups like in the name swap in the SAS documentation where the comma is lost? :

data ReversedNames;

  NAME='Jones, Fred';

  NAME2= prxchange('s/(\w+), (\w+)/$2 $1/', -1, NAME);

  put NAME2=;

run;

NAME2=Fred Jones

Frequent Contributor
Posts: 85

Re: Back reference in regex

The \10 worked as expected to  FIND a match, it was just the wrong syntax for the substitution.

And the abcdefg wasn't 'carried over' because it was part of the matching sub-string which was replaced. Put  a Z at the start of STR and you'll see.

I love regular expressions but they really do my head in.

PROC Star
Posts: 1,760

Re: Back reference in regex

Posted in reply to JerryLeBreton

Yes I did add the Z and I understand  now. I wasn't paying enough attention to the importance of how the "matching sub-string" is used when substituting.[1]

All is clear, thank you. Smiley Happy

The last question is why \8 did work, not in the find part of the regex, but in  8=\8  in the substitution part of the regex.

[1] This is also now also clear when regexes are used as a format as shown in thread. 281883 where the whole string has to be matched for the format to be applied.

It seems obvious now as I say it, but it puzzled me at first that we had to start and end with .* 

Frequent Contributor
Posts: 85

Re: Back reference in regex

Good question! 

Looks like \num is equivalent to $num in a substitution - as long as num is a single digit.  A syntax/context anomaly that a real expert might be able to comment on.

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 5 replies
  • 333 views
  • 3 likes
  • 2 in conversation