Solved: Re: Perl help

sunilpusarla · Posted 09-24-2013 02:51 PM

Hi There,
Good afternoon. Though I know how to extract the string using functions like index, substr, scan, and find etc., I want to do it with perl functions.
I started learning the Perl functions and I stuck at the below scenario. I need to extract the text inside the ending parenthesis (if there are multiple in an observation) into a new
variable. Before using the perl substitution function, I want to check weather I am getting the correct string. I tried the following but not
able to go further. Would anyone please help me where I am going wrong or how to get the ending parenthesis? (my syntax is working for 1,2 and
the last observation but not for the 3rd where I am getting the string from first parenthesis).

/* Test/Example data */
data a;
length a $30;
a = 'MCHC (g/dL)';
output;
a = 'RDW (%)';
output;
a = 'Granulocyte (immature) (%)';
output;
a = 'ptt';
output;
run;

/* My tried code */
data b;
set a;
if _N_ eq 1 then do;
    rc = 1;
    retain rc;
    rc = prxparse('/$[\w|\W]|[\%]$s+$/');
end;
new = prxmatch(rc,a);
run;

Any help will be appreciated.
Thanks,
Sunil

Patrick · Posted 09-28-2013 09:09 PM

Hi

It's always a bit try-and-error for getting regular expressions right. I believe below RegEx should work.

And just for terminology: The functions used (eg. prxmatch ) are SAS functions allowing you to use Perl Regular Expression syntax. There are other variations of the Regular Expression syntax (and SAS initially implemented its own as well - used by functions like rxmatch). The Perl syntax is today a quasi standard and the right thing to use as it is implemented in a lot of programming environments (so it's a transferrable skill set for you).

data b;
set a;

if _N_ eq 1 then
    do;
      retain rc;
      rc = prxparse('/$[^\($]*\)(?!.*\))/');
      if 0 then new=a;
    end;
call prxsubstr(rc,a,pos,len);
new = substrn(a,pos+1,len-2);
run;

View solution in original post

Patrick · Posted 09-28-2013 09:09 PM

Hi

It's always a bit try-and-error for getting regular expressions right. I believe below RegEx should work.

And just for terminology: The functions used (eg. prxmatch ) are SAS functions allowing you to use Perl Regular Expression syntax. There are other variations of the Regular Expression syntax (and SAS initially implemented its own as well - used by functions like rxmatch). The Perl syntax is today a quasi standard and the right thing to use as it is implemented in a lot of programming environments (so it's a transferrable skill set for you).

data b;
set a;

if _N_ eq 1 then
    do;
      retain rc;
      rc = prxparse('/$[^\($]*\)(?!.*\))/');
      if 0 then new=a;
    end;
call prxsubstr(rc,a,pos,len);
new = substrn(a,pos+1,len-2);
run;

sunilpusarla · Posted 09-30-2013 02:08 PM

Thank you so much, Patrick. It worked perfectly. "(?!.*\)" part is new to me. Will check about this in the documentation. My first programming (&may be the last too) language is SAS. I am from medical background and did not get chance (& thought no need) to learn any other programming languages. But within SAS, I learned many components of other languages (SQL etc.). Through the SAS communities I am learning a lot of new techniques and new ways of thinking. Thanks to many SAS communities too.

Best,

Sunil

Vince28_Statcan · Posted 09-30-2013 02:40 PM

Hi Suni,

"(?!.*\))"

(notice I added a missing closing paranthesis to what you said above)

Is a Zero-width negative lookahead assertion. That is, it looks through whatever has not been matched to this point until the end of the string or until it fails (in which case it would backtrack and see if the rest of the regex could capture what caused it to fail).

In essence it does the following:

after

$[^\($]*\)

is matched,

attempt to match however many anything denoted by the .* followed by a closing paranthesis denoted by \). If such a match had been found after the first portion had matched, it would backtrack and attempt another way to fulfill the first part before doing another negative lookahead. So in essence, patrick doesn't even bother reading everything from the start of the string. It merely attempts to read a paranthesis block, if it finds one, performs a negative lookahead to make sure that there are not other such paranthesis blocks and if the negative lookahead succeeds (no other enclosing paranthesis were found), it signals a match. If another enclosing paranthesis block was found with the ?!, the regex ignores the first potential match and moves the pointer after the first closing ) and attempts to find another block later in.

For the sake of PERL fun, the strings

'RDW (%))'

'RDW ((%))'

would not find a match through Patrick's approach because the very last encountered closing ) is cannot be paired with \([^()]*

Don't get me wrong it is likely that you won't have cases where it does not work with your data. I'm merely pointing out caveats of a perl approach versus another.

Vincent

sunilpusarla · Posted 10-01-2013 07:45 PM

Vince,

I appreciate your opinion and sharing of knowledge. These look like common data entry mistakes and one should aware of. It is definitely helpful for me. I never ever get anyone wrong for healthy discussion. In fact, these kind of discussions are very helpful for quick gain of knowledge and foresee problems so that one can go for robust code.

By the way, my approach, previous to this posting was:

if index(a,'(') then unit = strip(

compress(

substr(a,

find(a,'(',

-(length(a)))

),'()')); run;

I got 2 other responses in the SAS-L which are lot better.

Simple non-perl solution from 'Data _Null_':

if findc(a,')','b') then unit2 = scan(a,-1,')(','t');

and, a perl solution from SÃ¸ren Lassen:

data b;

set a;

length unit $40.;

prxid=prxparse('/.*$([^$]*)/');

if prxmatch(prxid,a) then

unit=prxposn(prxid,1,a);

run;

Thank you,

Sunil

Perl help

Re: Perl help

Re: Perl help

Re: Perl help

Re: Perl help

Re: Perl help

Click image to register for webinar

Classroom Training Available!