- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I am doing address normalization using a large number of regular expressions of the form
s/pattern/replace/
These regex are stored as a string in a dataset with one line and several hundreds of columns, such that I can do
data want; set addresses; if _n_ = 1 then do; set rules; array rules [*] rules:; end; do i = 1 to dim(rules); address = prxchange(rules[i], -1, address); end; run;
Each rule is a constant, thus I added the "o" modifier to compile each regex only once. However, then only the first regex is compiled.
Consider the following minimal example to illustrate this.
data _null_;
rules1 = "s/[-]/ /o";
rules2 = "s/(\().*//o";
array rules [*] rules:;
name = "Berlin-Mitte (Germany)";
with_array = upcase(name);
do i = 1 to dim(rules);
with_array = prxchange(rules[i], -1, with_array);
end;
no_array = upcase(name);
no_array = prxchange(rules1, -1, no_array);
no_array = prxchange(rules2, -1, no_array);
put with_array=;
put no_array=;
run;
This outputs
with_array=BERLIN MITTE (GERMANY) <<< should be BERLIN MITTE
no_array=BERLIN MITTE
where the second regex was not used in do-loop. If I omit the "o" modifier in the regex, everything works as expected.
What am I missing here?
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hallo @msauer,
As always, @data_null__'s solution is correct. Last month a similar issue was discussed in PRXMATCH not work in nested loop, where it was PGStats who suggested the array of IDs of compiled patterns created with the PRXPARSE function. The issue is not your array itself, but the fact that varying patterns are used in the same call of the PRXCHANGE or PRXPARSE function in the code (in a DO loop), which conflicts with the use of the "o" ("compile once") option.
Just for demonstration (not meant as a solution): Replace the DO loop in the "minimal example" of your initial post with two one-iteration loops and with_array will be updated correctly:
do i = 1 to dim(rules)-1;
with_array = prxchange(rules[i], -1, with_array);
end;
do i = 2 to dim(rules);
with_array = prxchange(rules[i], -1, with_array);
end;
As you said, you can safely omit the "o" option (i.e., remove the "o" either in dataset RULES or by adjustments to the code) when the regular expressions are compiled only once anyway because of the "if _n_=1 ..." and the PRXPARSE function. Otherwise, the elements of the rules_id array will all contain the same value (1) rather than 1, 2, ... and hence only represent the first rule. You can insert
put rules_id[i] = ;
after the assignment statement rules_id[i] = ... to see the difference. I would define the rules_id array as _temporary_ (advantage: automatic RETAIN and DROP). The dimension of the array does not need to match dim(rules) exactly as long as it's greater than or equal to that value, e.g. 9999. The DO loops will use dim(rules) as their end value.
I think the documentation means that the "o" option tells the compiler to compile the regular expression only once if it is in fact constant, yet provided as a variable (which in principle could change its value) in the code.
Simple example:
data test(drop=ptn);
retain ptn 's/(C\w+) \w+ (Disease)/$1 $2/o';
set sashelp.heart;
length shortDC $16;
shortDC=prxchange(ptn,1,DeathCause);
run;
Omitting the "o" increases the run time considerably (but it's still <1 second on my computer).
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Could you provide some example data? If your data doesn't contain any sensitive information, you can use this macro to convert your data set into a DATALINES statement.
https://blogs.sas.com/content/sastraining/2016/03/11/jedi-sas-tricks-data-to-data-step-macro/
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Use PRXPARSE
data _null_;
if _n_ eq 1 then do;
rules1 = prxparse("s/[-]/ /o");
rules2 = prxparse("s/(\().*//o");
array rules [*] rules:;
retain rules:;
end;
name = "Berlin-Mitte (Germany)";
with_array = upcase(name);
do i = 1 to dim(rules);
with_array = prxchange(rules[i], -1, with_array);
end;
no_array = upcase(name);
no_array = prxchange("s/[-]/ /o", -1, no_array);
no_array = prxchange("s/(\().*//o", -1, no_array);
put with_array=;
put no_array=;
run;
with_array=BERLIN MITTE
no_array=BERLIN MITTE
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The same issue occurs with PRXPARSE, too. PRXDEBUG only shows compiling the first regex.
data _null_;
if _n_ eq 1 then do;
rules1 = "s/[-]/ /o";
rules2 = "s/(\().*//o";
array rules [*] rules:;
array rules_id [2];
do i = 1 to dim(rules);
rules_id[i] = prxparse(rules[i]);
end;
retain rules:;
end;
name = "Berlin-Mitte (Germany)";
with_array = upcase(name);
do i = 1 to dim(rules);
with_array = prxchange(rules_id[i], -1, with_array);
end;
no_array = upcase(name);
no_array = prxchange("s/[-]/ /o", -1, no_array);
no_array = prxchange("s/(\().*//o", -1, no_array);
put with_array=;
put no_array=;
run;
Of course, I can omit the "o" modifier with this construct, since I explicitly compile the regex only once. But isn't the whole benefit of the modifier, that this should not be required. At least the documentation says so
This behavior simplifies the code because you do not need to use an initialization block (IF _N_ =1) to initialize Perl regular expressions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hallo @msauer,
As always, @data_null__'s solution is correct. Last month a similar issue was discussed in PRXMATCH not work in nested loop, where it was PGStats who suggested the array of IDs of compiled patterns created with the PRXPARSE function. The issue is not your array itself, but the fact that varying patterns are used in the same call of the PRXCHANGE or PRXPARSE function in the code (in a DO loop), which conflicts with the use of the "o" ("compile once") option.
Just for demonstration (not meant as a solution): Replace the DO loop in the "minimal example" of your initial post with two one-iteration loops and with_array will be updated correctly:
do i = 1 to dim(rules)-1;
with_array = prxchange(rules[i], -1, with_array);
end;
do i = 2 to dim(rules);
with_array = prxchange(rules[i], -1, with_array);
end;
As you said, you can safely omit the "o" option (i.e., remove the "o" either in dataset RULES or by adjustments to the code) when the regular expressions are compiled only once anyway because of the "if _n_=1 ..." and the PRXPARSE function. Otherwise, the elements of the rules_id array will all contain the same value (1) rather than 1, 2, ... and hence only represent the first rule. You can insert
put rules_id[i] = ;
after the assignment statement rules_id[i] = ... to see the difference. I would define the rules_id array as _temporary_ (advantage: automatic RETAIN and DROP). The dimension of the array does not need to match dim(rules) exactly as long as it's greater than or equal to that value, e.g. 9999. The DO loops will use dim(rules) as their end value.
I think the documentation means that the "o" option tells the compiler to compile the regular expression only once if it is in fact constant, yet provided as a variable (which in principle could change its value) in the code.
Simple example:
data test(drop=ptn);
retain ptn 's/(C\w+) \w+ (Disease)/$1 $2/o';
set sashelp.heart;
length shortDC $16;
shortDC=prxchange(ptn,1,DeathCause);
run;
Omitting the "o" increases the run time considerably (but it's still <1 second on my computer).
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks @FreelanceReinh for the detailed explanation.