Re: PRXPARSE, it should be possible to combine selections??

Wouter · Posted 07-24-2008 04:41 PM

All,

I've got a datastep in which I extract a number from a string. But... The strings are changing from time to time, but only the first part before the word "Incident" (sometime there appear some numbers, but I don't need those). So what I can do, is make 2 datasteps with the statements:

PATTERN = PRXPARSE("/^Incident/"); --> start from this word

PATTERN = PRXPARSE("/\d\d\d\d\d?/"); --> collect the desired number

But it must be able to combine these statements, I think? Saves a lot of time and space! 🙂 Thanks in advance!

Olivier · Posted 07-25-2008 10:23 AM

Hi Wouter.
Are you looking for something like that ?
[pre]
DATA work.test (DROP = regExp) ;
INFILE CARDS DLM = ";" ;
INPUT text :$40. ;
RETAIN regExp ;
IF _N_=1 THEN regExp = PRXPARSE("/(I|i)ncident(\d+).?/") ;
IF PRXMATCH(regExp, text) THEN number = PRXPOSN(regExp, 2, text)+0 ;
CARDS ;
Incident124
No incident at all
Incident2 @ 12:00
IncidentABC
Incident3ABC
Incident 3ABC
;
RUN ;
[/pre]
Regards,
Olivier

Wouter · Posted 07-25-2008 05:11 PM

Well, right now I'm using:

data test2;
set test1;
if _n_ = 1 then do;
PATTERN = PRXPARSE("/\d\d\d\d\d?/");

IF MISSING(PATTERN) THEN DO;
PUT "ERROR IN COMPILING REGULAR EXPRESSION";
STOP;
end;
end;
RETAIN PATTERN;
CALL PRXSUBSTR(PATTERN,test1,START,LENGTH);
IF START GT 0 THEN DO;
NUMBER = SUBSTR(test1,START,LENGTH);
NUMBER = COMPRESS(NUMBER," ");
OUTPUT;
END;
run;

Test1 contains data like:
Incident 43244
Incident 894232
43243 Incident 44322
Incident 23
988 Incident 4322

So what I need is only the number after the word "Incident". Unfortunately, I can't check your code right now. I assume the (I|i) part is to make a distintion between capital written "Incidents"? And you use the PRXMATCH function, that's the only way I think if it isn't possible to make 1 PRXPARSE statement in which you state 2 different cases (start from the word "Incident" with should be possible with the "^" option, and from there the "d's". I'll let you know, thanks!

Cynthia_sas · Posted 07-25-2008 06:35 PM

Hi,
This is a totally old school example, using INDEX and COMPRESS but I threw in PRXMATCH(see below) to compare to INDEX and both PRXMATCH and INDEX return the same results (if you compare FOUNDIT and FOUNDIT2). The COMPRESS/SUBSTR is not as elegant as the other solution but it does the job.
cynthia

[pre]
data prxtest;
length grp $1 string $100;
infile datalines dsd dlm=',';
input grp $ string $;
return;
datalines;
a,"The 1st Incident was when 12345 (Mr. Dumpty) fell off the wall."
b,"The 2nd Incident was when 34567 (Ms. Muffet) fell off a stool."
c,"Has the 123 word Incident, but there are no numbers after 'Incident'."
;
run;

proc print data=prxtest;
title 'What does the data look like';
run;

data checkdata;
length gotnum 8.;
set prxtest;
retain lookfor ;

if _n_=1 then do;
** Create pattern with prxparse.;
lookfor = prxparse('/Incident/');
end;

** Prxmatch returns the location in Arg2,;
** where ARG1 begins.;
** Note how prxmatch and index return the same number;
** Do you really need prxparse/prxmatch?;
** Will Index function work for your data?;
foundit = prxmatch(lookfor,string);
foundit2 = index(string,'Incident');

** If the pattern has been found;
** substring out everything AFTER;
** the word "Incident". Then, compress;
** out the punctuation and upper and lower case letters.;
** What should be left are the numbers after the string Incident.;
if foundit gt 0 then do;
gotnum = input((compress(substr(string,foundit+8),'.,;:()','al')),8.0);
end;
else do;
gotnum = .;
end;
run;

proc print data=checkdata;
title 'Found "Incident" Got number';
run;
[/pre]

Wouter · Posted 07-26-2008 06:41 AM

Hi Cynthia,

The strange thing is, with this code, the result is always 1 when there's no number before the word "incident", and otherwise I get 2 numbers which I couldn't relate to the numbers before "incident".

Wouter · Posted 07-26-2008 06:30 AM

Yes Olivier, thanks!!

I've changed the statement a little bit (because there's a space between the actual number and "Incident", but it works perfectly!!

Right now, it is:

DATA work.test2 ;
set test;
RETAIN regExp ;
IF _N_=1 THEN regExp = PRXPARSE("/(I|i)ncident\s(\d+).?/") ;
IF PRXMATCH(regExp, text) THEN number = PRXPOSN(regExp, 2,text) ;
run;

PRXPARSE, it should be possible to combine selections??