BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
sunilpusarla
Fluorite | Level 6

Hi There,
Good afternoon. Though I know how to extract the string using functions like index, substr, scan, and find etc., I want to do it with perl functions.
I started learning the Perl functions and I stuck at the below scenario. I need to extract the text inside the ending parenthesis (if there are multiple in an observation) into a new
variable. Before using the perl substitution function, I want to check weather I am getting the correct string. I tried the following but not
able to go further. Would anyone please help me where I am going wrong or how to get the ending parenthesis? (my syntax is working for 1,2 and
the last observation but not for the 3rd where I am getting the string from first parenthesis).

/* Test/Example data */
data a;
  length a $30;
  a = 'MCHC (g/dL)';
  output;
  a = 'RDW (%)';
  output;
  a = 'Granulocyte (immature) (%)';
  output;
  a = 'ptt';
  output;
run;

/* My tried code */
data b;
  set a;
  if _N_ eq 1 then do;
    rc = 1;
    retain rc;
    rc = prxparse('/\([\w|\W]|[\%]\)s+$/');
  end;
  new = prxmatch(rc,a);
run;


Any help will be appreciated.
Thanks,
Sunil

1 ACCEPTED SOLUTION

Accepted Solutions
Patrick
Opal | Level 21

Hi

It's always a bit try-and-error for getting regular expressions right. I believe below RegEx should work.

And just for terminology: The functions used (eg. prxmatch ) are SAS functions allowing you to use Perl Regular Expression syntax. There are other variations of the Regular Expression syntax (and SAS initially implemented its own as well - used by functions like rxmatch). The Perl syntax is today a quasi standard and the right thing to use as it is implemented in a lot of programming environments (so it's a transferrable skill set for you).

data b;
  set a;

  if _N_ eq 1 then
    do;
      retain rc;
      rc = prxparse('/\([^\(\)]*\)(?!.*\))/');
      if 0 then new=a;
    end;
  call prxsubstr(rc,a,pos,len);
  new = substrn(a,pos+1,len-2);
run;

View solution in original post

4 REPLIES 4
Patrick
Opal | Level 21

Hi

It's always a bit try-and-error for getting regular expressions right. I believe below RegEx should work.

And just for terminology: The functions used (eg. prxmatch ) are SAS functions allowing you to use Perl Regular Expression syntax. There are other variations of the Regular Expression syntax (and SAS initially implemented its own as well - used by functions like rxmatch). The Perl syntax is today a quasi standard and the right thing to use as it is implemented in a lot of programming environments (so it's a transferrable skill set for you).

data b;
  set a;

  if _N_ eq 1 then
    do;
      retain rc;
      rc = prxparse('/\([^\(\)]*\)(?!.*\))/');
      if 0 then new=a;
    end;
  call prxsubstr(rc,a,pos,len);
  new = substrn(a,pos+1,len-2);
run;

sunilpusarla
Fluorite | Level 6

Thank you so much, Patrick. It worked perfectly. "(?!.*\)" part is new to me. Will check about this in the documentation. My first programming (&may be the last too) language is SAS. I am from medical background and did not get chance (& thought no need) to learn any other programming languages. But within SAS, I learned many components of other languages (SQL etc.). Through the SAS communities I am learning a lot of new techniques and new ways of thinking. Thanks to many SAS communities too.

Best,

Sunil

Vince28_Statcan
Quartz | Level 8

Hi Suni,

"(?!.*\))"

(notice I added a missing closing paranthesis to what you said above)

Is a Zero-width negative lookahead assertion. That is, it looks through whatever has not been matched to this point until the end of the string or until it fails (in which case it would backtrack and see if the rest of the regex could capture what caused it to fail).

In essence it does the following:

after

\([^\(\)]*\)

is matched,

attempt to match however many anything denoted by the .* followed by a closing paranthesis denoted by \). If such a match had been found after the first portion had matched, it would backtrack and attempt another way to fulfill the first part before doing another negative lookahead. So in essence, patrick doesn't even bother reading everything from the start of the string. It merely attempts to read a paranthesis block, if it finds one, performs a negative lookahead to make sure that there are not other such paranthesis blocks and if the negative lookahead succeeds (no other enclosing paranthesis were found), it signals a match. If another enclosing paranthesis block was found with the ?!, the regex ignores the first potential match and moves the pointer after the first closing ) and attempts to find another block later in.

For the sake of PERL fun, the strings

'RDW (%))'

'RDW ((%))'

would not find a match through Patrick's approach because the very last encountered closing ) is cannot be paired with \([^()]*

Don't get me wrong it is likely that you won't have cases where it does not work with your data. I'm merely pointing out caveats of a perl approach versus another.

Vincent

sunilpusarla
Fluorite | Level 6

Vince,

I appreciate your opinion and sharing of knowledge. These look like common data entry mistakes and one should aware of. It is definitely helpful for me. I never ever get anyone wrong for healthy discussion. In fact, these kind of discussions are very helpful for quick gain of knowledge and foresee problems so that one can go for robust code.

By the way, my approach, previous to this posting was:

if index(a,'(') then unit = strip(

                              compress(

                              substr(a,

                              find(a,'(',

                              -(length(a)))

                              ),'()')); run;

I got 2 other responses in the SAS-L which are lot better.

Simple non-perl solution from 'Data _Null_':

if findc(a,')','b') then unit2 = scan(a,-1,')(','t');

and, a perl solution from Søren Lassen:

data b;

  set a;

  length unit $40.;

  prxid=prxparse('/.*\(([^\)]*)/');

  if prxmatch(prxid,a) then

    unit=prxposn(prxid,1,a);

run;

Thank you,

Sunil

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 813 views
  • 2 likes
  • 3 in conversation