BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
msauer
Obsidian | Level 7

I am doing address normalization using a large number of regular expressions of the form

s/pattern/replace/

These regex are stored as a string in a dataset with one line and several hundreds of columns, such that I can do

data want;
  set addresses;
  if _n_ = 1 then do;
    set rules;
    array rules [*] rules:;
  end;
  do i = 1 to dim(rules);
    address = prxchange(rules[i], -1, address);
  end;
run;

Each rule is a constant, thus I added the "o" modifier to compile each regex only once. However, then only the first regex is compiled.

 

Consider the following minimal example to illustrate this.

data _null_;
  rules1 = "s/[-]/ /o";
  rules2 = "s/(\().*//o";
  array rules [*] rules:;
  name = "Berlin-Mitte (Germany)";
  with_array = upcase(name);
  do i = 1 to dim(rules);
    with_array = prxchange(rules[i], -1, with_array);
  end;
no_array = upcase(name);
no_array = prxchange(rules1, -1, no_array);
no_array = prxchange(rules2, -1, no_array);
put with_array=;
put no_array=; run;

 This outputs

with_array=BERLIN MITTE (GERMANY) <<< should be BERLIN MITTE
no_array=BERLIN MITTE

where the second regex was not used in do-loop. If I omit the "o" modifier in the regex, everything works as expected.

 

What am I missing here?

1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Hallo @msauer,

 

As always, @data_null__'s solution is correct. Last month a similar issue was discussed in PRXMATCH not work in nested loop, where it was PGStats who suggested the array of IDs of compiled patterns created with the PRXPARSE function. The issue is not your array itself, but the fact that varying patterns are used in the same call of the PRXCHANGE or PRXPARSE function in the code (in a DO loop), which conflicts with the use of the "o" ("compile once") option.

 

Just for demonstration (not meant as a solution): Replace the DO loop in the "minimal example" of your initial post with two one-iteration loops and with_array will be updated correctly:

  do i = 1 to dim(rules)-1;
    with_array = prxchange(rules[i], -1, with_array);
  end;
  do i = 2 to dim(rules);
    with_array = prxchange(rules[i], -1, with_array);
  end;

 

As you said, you can safely omit the "o" option (i.e., remove the "o" either in dataset RULES or by adjustments to the code) when the regular expressions are compiled only once anyway because of the "if _n_=1 ..." and the PRXPARSE function. Otherwise, the elements of the rules_id array will all contain the same value (1) rather than 1, 2, ... and hence only represent the first rule. You can insert

put rules_id[i] = ;

after the assignment statement rules_id[i] = ... to see the difference. I would define the rules_id array as _temporary_ (advantage: automatic RETAIN and DROP). The dimension of the array does not need to match dim(rules) exactly as long as it's greater than or equal to that value, e.g. 9999. The DO loops will use dim(rules) as their end value.

 

I think the documentation means that the "o" option tells the compiler to compile the regular expression only once if it is in fact constant, yet provided as a variable (which in principle could change its value) in the code.

 

Simple example:

data test(drop=ptn);
retain ptn 's/(C\w+) \w+ (Disease)/$1 $2/o';
set sashelp.heart;
length shortDC $16;
shortDC=prxchange(ptn,1,DeathCause);
run;

Omitting the "o" increases the run time considerably (but it's still <1 second on my computer).

View solution in original post

6 REPLIES 6
maguiremq
SAS Super FREQ

Could you provide some example data? If your data doesn't contain any sensitive information, you can use this macro to convert your data set into a DATALINES statement.

 

https://blogs.sas.com/content/sastraining/2016/03/11/jedi-sas-tricks-data-to-data-step-macro/

msauer
Obsidian | Level 7
The question includes a minimal example with two regular expressions and one "address" to illustrate the issue. No need for more data.
data_null__
Jade | Level 19

Use PRXPARSE

 

data _null_;
   if _n_ eq 1 then do;
      rules1 = prxparse("s/[-]/ /o");
      rules2 = prxparse("s/(\().*//o");
      array rules [*] rules:;
      retain rules:;
      end;
   name = "Berlin-Mitte (Germany)";
   with_array = upcase(name);
   do i = 1 to dim(rules);
      with_array = prxchange(rules[i], -1, with_array);
      end;

   no_array = upcase(name);
   no_array = prxchange("s/[-]/ /o", -1, no_array);
   no_array = prxchange("s/(\().*//o", -1, no_array);
   put with_array=;
   put no_array=;
   run;

with_array=BERLIN MITTE
no_array=BERLIN MITTE
msauer
Obsidian | Level 7

The same issue occurs with PRXPARSE, too. PRXDEBUG only shows compiling the first regex.

data _null_;
  if _n_ eq 1 then do;
    rules1 = "s/[-]/ /o";
    rules2 = "s/(\().*//o";
    array rules [*] rules:;
    array rules_id [2];
    do i = 1 to dim(rules);
      rules_id[i] = prxparse(rules[i]);
    end;
    retain rules:;
  end;
  name = "Berlin-Mitte (Germany)";
  with_array = upcase(name);
  do i = 1 to dim(rules);
    with_array = prxchange(rules_id[i], -1, with_array);
  end;
  no_array = upcase(name);
  no_array = prxchange("s/[-]/ /o", -1, no_array);
  no_array = prxchange("s/(\().*//o", -1, no_array);
  put with_array=;
  put no_array=;
run;

 Of course, I can omit the "o" modifier with this construct, since I explicitly compile the regex only once. But isn't the whole benefit of the modifier, that this should not be required. At least the documentation says so

 This behavior simplifies the code because you do not need to use an initialization block (IF _N_ =1) to initialize Perl regular expressions.

FreelanceReinh
Jade | Level 19

Hallo @msauer,

 

As always, @data_null__'s solution is correct. Last month a similar issue was discussed in PRXMATCH not work in nested loop, where it was PGStats who suggested the array of IDs of compiled patterns created with the PRXPARSE function. The issue is not your array itself, but the fact that varying patterns are used in the same call of the PRXCHANGE or PRXPARSE function in the code (in a DO loop), which conflicts with the use of the "o" ("compile once") option.

 

Just for demonstration (not meant as a solution): Replace the DO loop in the "minimal example" of your initial post with two one-iteration loops and with_array will be updated correctly:

  do i = 1 to dim(rules)-1;
    with_array = prxchange(rules[i], -1, with_array);
  end;
  do i = 2 to dim(rules);
    with_array = prxchange(rules[i], -1, with_array);
  end;

 

As you said, you can safely omit the "o" option (i.e., remove the "o" either in dataset RULES or by adjustments to the code) when the regular expressions are compiled only once anyway because of the "if _n_=1 ..." and the PRXPARSE function. Otherwise, the elements of the rules_id array will all contain the same value (1) rather than 1, 2, ... and hence only represent the first rule. You can insert

put rules_id[i] = ;

after the assignment statement rules_id[i] = ... to see the difference. I would define the rules_id array as _temporary_ (advantage: automatic RETAIN and DROP). The dimension of the array does not need to match dim(rules) exactly as long as it's greater than or equal to that value, e.g. 9999. The DO loops will use dim(rules) as their end value.

 

I think the documentation means that the "o" option tells the compiler to compile the regular expression only once if it is in fact constant, yet provided as a variable (which in principle could change its value) in the code.

 

Simple example:

data test(drop=ptn);
retain ptn 's/(C\w+) \w+ (Disease)/$1 $2/o';
set sashelp.heart;
length shortDC $16;
shortDC=prxchange(ptn,1,DeathCause);
run;

Omitting the "o" increases the run time considerably (but it's still <1 second on my computer).

msauer
Obsidian | Level 7

Thanks @FreelanceReinh for the detailed explanation.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 6 replies
  • 2434 views
  • 3 likes
  • 4 in conversation