SAS Programming

msauer · Posted 04-12-2022 06:59 AM

I am doing address normalization using a large number of regular expressions of the form

s/pattern/replace/

These regex are stored as a string in a dataset with one line and several hundreds of columns, such that I can do

data want;
  set addresses;
  if _n_ = 1 then do;
    set rules;
    array rules [*] rules:;
  end;
  do i = 1 to dim(rules);
    address = prxchange(rules[i], -1, address);
  end;
run;

Each rule is a constant, thus I added the "o" modifier to compile each regex only once. However, then only the first regex is compiled.

Consider the following minimal example to illustrate this.

data _null_;
  rules1 = "s/[-]/ /o";
  rules2 = "s/(\().*//o";
  array rules [*] rules:;
  name = "Berlin-Mitte (Germany)";
  with_array = upcase(name);
  do i = 1 to dim(rules);
    with_array = prxchange(rules[i], -1, with_array);
  end;
  no_array = upcase(name);
  no_array = prxchange(rules1, -1, no_array);
  no_array = prxchange(rules2, -1, no_array);
  put with_array=;
  put no_array=;
run;

This outputs

with_array=BERLIN MITTE (GERMANY) <<< should be BERLIN MITTE
no_array=BERLIN MITTE

where the second regex was not used in do-loop. If I omit the "o" modifier in the regex, everything works as expected.

What am I missing here?

FreelanceReinh · Posted 04-12-2022 12:01 PM

Hallo @msauer,

As always, @data_null__'s solution is correct. Last month a similar issue was discussed in PRXMATCH not work in nested loop, where it was PGStats who suggested the array of IDs of compiled patterns created with the PRXPARSE function. The issue is not your array itself, but the fact that varying patterns are used in the same call of the PRXCHANGE or PRXPARSE function in the code (in a DO loop), which conflicts with the use of the "o" ("compile once") option.

Just for demonstration (not meant as a solution): Replace the DO loop in the "minimal example" of your initial post with two one-iteration loops and with_array will be updated correctly:

  do i = 1 to dim(rules)-1;
    with_array = prxchange(rules[i], -1, with_array);
  end;
  do i = 2 to dim(rules);
    with_array = prxchange(rules[i], -1, with_array);
  end;

As you said, you can safely omit the "o" option (i.e., remove the "o" either in dataset RULES or by adjustments to the code) when the regular expressions are compiled only once anyway because of the "if _n_=1 ..." and the PRXPARSE function. Otherwise, the elements of the rules_id array will all contain the same value (1) rather than 1, 2, ... and hence only represent the first rule. You can insert

put rules_id[i] = ;

after the assignment statement rules_id[i] = ... to see the difference. I would define the rules_id array as _temporary_ (advantage: automatic RETAIN and DROP). The dimension of the array does not need to match dim(rules) exactly as long as it's greater than or equal to that value, e.g. 9999. The DO loops will use dim(rules) as their end value.

I think the documentation means that the "o" option tells the compiler to compile the regular expression only once if it is in fact constant, yet provided as a variable (which in principle could change its value) in the code.

Simple example:

data test(drop=ptn);
retain ptn 's/(C\w+) \w+ (Disease)/$1 $2/o';
set sashelp.heart;
length shortDC $16;
shortDC=prxchange(ptn,1,DeathCause);
run;

Omitting the "o" increases the run time considerably (but it's still <1 second on my computer).

View solution in original post

maguiremq · Posted 04-12-2022 07:15 AM

Could you provide some example data? If your data doesn't contain any sensitive information, you can use this macro to convert your data set into a DATALINES statement.

https://blogs.sas.com/content/sastraining/2016/03/11/jedi-sas-tricks-data-to-data-step-macro/

msauer · Posted 04-12-2022 07:27 AM

The question includes a minimal example with two regular expressions and one "address" to illustrate the issue. No need for more data.

data_null__ · Posted 04-12-2022 08:59 AM

Use PRXPARSE

data _null_;
   if _n_ eq 1 then do;
      rules1 = prxparse("s/[-]/ /o");
      rules2 = prxparse("s/(\().*//o");
      array rules [*] rules:;
      retain rules:;
      end;
   name = "Berlin-Mitte (Germany)";
   with_array = upcase(name);
   do i = 1 to dim(rules);
      with_array = prxchange(rules[i], -1, with_array);
      end;

   no_array = upcase(name);
   no_array = prxchange("s/[-]/ /o", -1, no_array);
   no_array = prxchange("s/(\().*//o", -1, no_array);
   put with_array=;
   put no_array=;
   run;

with_array=BERLIN MITTE
no_array=BERLIN MITTE

msauer · Posted 04-12-2022 09:53 AM

The same issue occurs with PRXPARSE, too. PRXDEBUG only shows compiling the first regex.

data _null_;
  if _n_ eq 1 then do;
    rules1 = "s/[-]/ /o";
    rules2 = "s/(\().*//o";
    array rules [*] rules:;
    array rules_id [2];
    do i = 1 to dim(rules);
      rules_id[i] = prxparse(rules[i]);
    end;
    retain rules:;
  end;
  name = "Berlin-Mitte (Germany)";
  with_array = upcase(name);
  do i = 1 to dim(rules);
    with_array = prxchange(rules_id[i], -1, with_array);
  end;
  no_array = upcase(name);
  no_array = prxchange("s/[-]/ /o", -1, no_array);
  no_array = prxchange("s/(\().*//o", -1, no_array);
  put with_array=;
  put no_array=;
run;

Of course, I can omit the "o" modifier with this construct, since I explicitly compile the regex only once. But isn't the whole benefit of the modifier, that this should not be required. At least the documentation says so

This behavior simplifies the code because you do not need to use an initialization block (IF _N_ =1) to initialize Perl regular expressions.

FreelanceReinh · Posted 04-12-2022 12:01 PM

Hallo @msauer,

As always, @data_null__'s solution is correct. Last month a similar issue was discussed in PRXMATCH not work in nested loop, where it was PGStats who suggested the array of IDs of compiled patterns created with the PRXPARSE function. The issue is not your array itself, but the fact that varying patterns are used in the same call of the PRXCHANGE or PRXPARSE function in the code (in a DO loop), which conflicts with the use of the "o" ("compile once") option.

Just for demonstration (not meant as a solution): Replace the DO loop in the "minimal example" of your initial post with two one-iteration loops and with_array will be updated correctly:

  do i = 1 to dim(rules)-1;
    with_array = prxchange(rules[i], -1, with_array);
  end;
  do i = 2 to dim(rules);
    with_array = prxchange(rules[i], -1, with_array);
  end;

As you said, you can safely omit the "o" option (i.e., remove the "o" either in dataset RULES or by adjustments to the code) when the regular expressions are compiled only once anyway because of the "if _n_=1 ..." and the PRXPARSE function. Otherwise, the elements of the rules_id array will all contain the same value (1) rather than 1, 2, ... and hence only represent the first rule. You can insert

put rules_id[i] = ;

after the assignment statement rules_id[i] = ... to see the difference. I would define the rules_id array as _temporary_ (advantage: automatic RETAIN and DROP). The dimension of the array does not need to match dim(rules) exactly as long as it's greater than or equal to that value, e.g. 9999. The DO loops will use dim(rules) as their end value.

I think the documentation means that the "o" option tells the compiler to compile the regular expression only once if it is in fact constant, yet provided as a variable (which in principle could change its value) in the code.

Simple example:

data test(drop=ptn);
retain ptn 's/(C\w+) \w+ (Disease)/$1 $2/o';
set sashelp.heart;
length shortDC $16;
shortDC=prxchange(ptn,1,DeathCause);
run;

Omitting the "o" increases the run time considerably (but it's still <1 second on my computer).

msauer · Posted 04-13-2022 01:24 AM

Thanks @FreelanceReinh for the detailed explanation.

SAS Programming

Array of regular expressions

Re: Array of regular expressions

Re: Array of regular expressions

Re: Array of regular expressions

Re: Array of regular expressions

Re: Array of regular expressions

Re: Array of regular expressions

Re: Array of regular expressions

a question on regular expression

Regular expression to a url link

Perl regular expression

Regular expressions: match dot and comma as characters

How do I use regular expressions in SAS®? Q&A, slides, and on-demand r...

Follow Us

What is...

SAS Programming

Special offer for SAS Communities members

SAS Training: Just a Click Away

Follow Us

What is...