DATA Step, Macro, Functions and more

regular expression

Reply
Super Contributor
Posts: 673

regular expression

10MG/2ML
5MG/ML
12MG KIT
0.4MG
24MG KIT
5MG
0.6MG
10MG/2ML
5MGFLEX
10MG

for the above data is used the following reg expression:

if _N_ = 1 then RE = PRXPARSE ("/ \d{1,5}\.?\d{0,4}\?mg/i");
retain RE;
call PRXSUBSTR(RE,Drug_Strength_Name,START,LENGTH);
if START GT 0 then do;
str = SUBSTRN(Drug_Strength_Name,START+1 ,LENGTH-1 );
output;
end;
run;


it dosent return any values at all.
Super Contributor
Super Contributor
Posts: 3,174

Re: regular expression

Suggest you add some SAS diagnostic statements ("nn" increments for easier correlation in code to log):

PUTLOG '>DIAGnn ' _all_;

Hopefully this additional info will help you diagnose the problem systematically.

Scott Barry
SBBWorks, Inc.
Super Contributor
Posts: 394

Re: regular expression

This program works for the data set you posted:
[pre]
data _null_;
retain re;
infile datalines;
input Drug_Strength_Name $ 1-9;

if _N_ = 1 then
re = prxparse("/\d+(\.\d+)?MG((\/\dML)| KIT|FLEX)?/i");
call prxsubstr(re,Drug_Strength_Name, start, length);
if start gt 0 then do;
str = substrn(Drug_Strength_Name, start, length);
put str;
end;
datalines;
10MG/2ML
5MG/ML
12MG KIT
0.4MG
24MG KIT
5MG
0.6MG
10MG/2ML
5MGFLEX
10MG
;;;;
[/pre]
Log:
[pre]
704 data _null_;
705 retain re;
706 infile datalines;
707 input Drug_Strength_Name $ 1-9;
708
709 if _N_ = 1 then
710 re = prxparse("/\d+(\.\d+)?MG((\/\dML)| KIT|FLEX)?/i");
711 call prxsubstr(re,Drug_Strength_Name, start, length);
712 if start gt 0 then do;
713 str = substrn(Drug_Strength_Name, start, length);
714 put str;
715 end;
716 datalines;

10MG/2ML
5MG
12MG KIT
0.4MG
24MG KIT
5MG
0.6MG
10MG/2ML
5MGFLEX
10MG
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds


727 ;;;;
[/pre]
Contributor
Posts: 31

Re: regular expression

Hi SASPhile,
Tim@SAS gave the perfect regular expression.
I came up with re = PRXPARSE ("/[0-9]+(\.[0-9]*)?MG(\/[0-9]*ML| KIT|FLEX)?/i");
I realized its similar to what Tim@SAS posted Smiley Sad .

Just to add more to the logic and regular expression. Whenever u are trying to create a regular expression try and break down the pattern. In your case of provided data
1) Integer part hence [0-9]+ <--- i come from perl background hence used to [0-9] instead of \d, don't mind.
2) Decimal part(may or may not come hence optional) hence (\.[0-9]*)?
3) MG is just after the integer or decimal part (always coming) hence MG
4) (An Integer may or may not come with ML) or ( KIT may come with leading space) or (FLEX would come) ....as any of the three cases can happen or may not happen at all (e.g 10MG) hence above 3 cases are optional hence regular expression for this part would be ----> (\/[0-9]*ML| KIT|FLEX)?

Now going back to the original regular expression that you coded
RE = PRXPARSE ("/ \d{1,5}\.?\d{0,4}\?mg/i") . For the datalines written by Tim@SAS, your regular expression would not work. The reasons are :
1) PRXPARSE ("/ \d ..if you look closely, there is a space regulare expression is looking for before the integer. Maybe your data has that space, not sure thats why im pointing it out.
2) d{0,4}\? would match in case of 0.4?MG and not 0.4MG hence it should have been like d{0,4}?
Even after fixing the above 2 things you would only get Integer + decimal + MG being pattern matched (e.g. 10.4MG , 10MG & 0.4MG ) the KITS/2ML/ML etc would still not get matched. For that check out the regular expression posted by me or Tim@SAS

Njoy understanding the patterns!!!!!!!!!! Smiley Happy
Super Contributor
Posts: 673

Re: regular expression

Posted in reply to SushilNayak
Thanks Guys!
Super Contributor
Posts: 394

Re: regular expression

Posted in reply to SushilNayak
Actually I missed one: 5MG/ML. The RE needs to accept a slash followed by 0 or more digits followed by ML: "\/\d*ML". Here's the corrected version:

[pre]
data _null_;
retain re;
infile datalines;
input Drug_Strength_Name $ 1-9;

if _N_ = 1 then
re = prxparse("/\d+(\.\d+)?MG((\/\d*ML)| KIT|FLEX)?/i");
call prxsubstr(re,Drug_Strength_Name, start, length);
if start gt 0 then do;
str = substrn(Drug_Strength_Name, start, length);
put str;
end;
datalines;
10MG/2ML
5MG/ML
12MG KIT
0.4MG
24MG KIT
5MG
0.6MG
10MG/2ML
5MGFLEX
10MG
;;;;
[/pre]
Ask a Question
Discussion stats
  • 5 replies
  • 180 views
  • 0 likes
  • 4 in conversation