- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I want to extract text from a string using mutiples patterns. I am getting error "The PRXPARSE function call does not have enough arguments.".
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Something like below should do.
I've modified your RegEx adding word boundary metacharacter \b so your 2nd regex does not match Maple Street
data have;
input street $80.;
datalines;
Bldg A 153 First Street
6789 64th Ave
4 Moritz Road
7493 Wilkes Place
711 Maple Street
;
run;
data patterns;
input regex :$100.;
datalines;
m/\d+\s[a-z]+\s[a-z]+/i
m/\b(Pl|place)\b/i
m/\b(rd|road)\b/i
m/\b(ave|avenue)\b/i
;
run;
data _null_;
call symputx('n_patterns',nobs);
stop;
set patterns nobs=nobs;
run;
data want;
set have;
if _n_=1 then
do;
array expr_id {&n_patterns} _temporary_;
do i=1 by 1 until(last);
set patterns end=last;
expr_id[i]=prxparse(strip(regex));
end;
/* create variable match with same length as variable street */
if 0 then match=street;
length matchtype $8;
end;
do i=1 to dim(expr_id);
call prxsubstr(expr_id[i], street, position, length);
if position> 0 then
do;
match=substr(street, position, length);
matchtype=cats('pattern', i);
output;
end;
end;
drop regex i;
run;
proc print data=want;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Don't use macro language if not necessary. It only makes debugging harder.
data have;
input street $80.;
datalines;
Bldg A 153 First Street
6789 64th Ave
4 Moritz Road
7493 Wilkes Place
;
run;
data want;
set have;
if _n_=1 then
do;
pattern1="m/\d+\s[a-z]+\s[a-z]+/i";
pattern2="m/Pl|place/i";
pattern3="m/rd|road/i";
pattern4="m/ave|avenue/i";
array patterns{4} pattern1 - pattern4;
array expr_id {4} _temporary_;
do i=1 to dim(patterns);
expr_id[i]=prxparse(patterns[i]);
end;
length matchtype $8;
/* create variable match with same length as variable street */
if 0 then match=street;
end;
do i=1 to dim(patterns);
call prxsubstr(expr_id[i], street, position, length);
if position> 0 then
do;
match=substr(street, position, length);
matchtype=cats('pattern', i);
output;
end;
end;
drop Pattern: i;
run;
proc print data=want;
run;
Or even shorter:
data want;
set have;
if _n_=1 then
do;
array expr_id {4} _temporary_;
expr_id[1]=prxparse("m/\d+\s[a-z]+\s[a-z]+/i");
expr_id[2]=prxparse("m/Pl|place/i");
expr_id[3]=prxparse("m/rd|road/i");
expr_id[4]=prxparse("m/ave|avenue/i");
/* create variable match with same length as variable street */
if 0 then match=street;
length matchtype $8;
end;
do i=1 to dim(expr_id);
call prxsubstr(expr_id[i], street, position, length);
if position> 0 then
do;
match=substr(street, position, length);
matchtype=cats('pattern', i);
output;
end;
end;
drop i;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi Patrick
Thanks for your prompt reply. Is there anyway you can separate the pattern and prxsubstr code into two data steps, I want to use the same pattern for multiple data. Thanks a lot
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Something like below should do.
I've modified your RegEx adding word boundary metacharacter \b so your 2nd regex does not match Maple Street
data have;
input street $80.;
datalines;
Bldg A 153 First Street
6789 64th Ave
4 Moritz Road
7493 Wilkes Place
711 Maple Street
;
run;
data patterns;
input regex :$100.;
datalines;
m/\d+\s[a-z]+\s[a-z]+/i
m/\b(Pl|place)\b/i
m/\b(rd|road)\b/i
m/\b(ave|avenue)\b/i
;
run;
data _null_;
call symputx('n_patterns',nobs);
stop;
set patterns nobs=nobs;
run;
data want;
set have;
if _n_=1 then
do;
array expr_id {&n_patterns} _temporary_;
do i=1 by 1 until(last);
set patterns end=last;
expr_id[i]=prxparse(strip(regex));
end;
/* create variable match with same length as variable street */
if 0 then match=street;
length matchtype $8;
end;
do i=1 to dim(expr_id);
call prxsubstr(expr_id[i], street, position, length);
if position> 0 then
do;
match=substr(street, position, length);
matchtype=cats('pattern', i);
output;
end;
end;
drop regex i;
run;
proc print data=want;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You can avoid needing to make the macro with the number of patterns. Just make the array large enough for the maximum number of patterns you ever expect to have to handle.
This example uses is set to handle 9,999 patterns. But even 99,999 or more should not cause any trouble. Just make sure to adjust the array size and the length of the MATCHTYPE variable. (Or just keep the loop counter numeric variable instead.)
data want;
set have;
* Create variable match with same length as variable street ;
if 0 then match=street;
* Set length of MATCHTYPE long enough for up to 9999 patterns ;
length matchtype $11;
* Make array large enough for 9999 patterns ;
array expr_id [9999] _temporary_;
if _n_=1 then do pattern=1 to nobs;
* Parse regex patterns into array ;
set patterns nobs=nobs;
expr_id[pattern]=prxparse(strip(regex));
end;
* Output any matches ;
do pattern=1 to nobs;
call prxsubstr(expr_id[pattern], street, position, length);
if position> 0 then do;
match=substr(street, position, length);
matchtype=cats('pattern', pattern);
output;
end;
end;
drop regex pattern;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Tom Sure, that will work as well but I can't see the hurt in an additional simple data _null_ step that won't iterate through the data.