Re: prxmatch() regular expression remove roman numbers

starosto · Posted 07-04-2023 06:41 AM

Hello,

I would like to receive two variables. One with TEXT1 and the other one with TEXT2.

Below code does not work.

Reason = strip(prxchange("s/(.*?)[:,](.*)/$1/", -1 , strip(Value)));
ReasonSp = strip(prxchange("s/(.*?)[:,](.*)/$2/", -1, strip(Value)));

Could you please help?

This is my Value variable:

0. TEXT1.

I. TEXT1, i.e. TEXT2.
II.a. TEXT1: TEXT2.
II.b. TEXT1: TEXT2.
III.a. TEXT1: TEXT2.
III.b. TEXT1: TEXT2.
III.c. TEXT1: TEXT2.
IV. TEXT1.
V.a. TEXT1: TEXT2.
V.b. TEXT1: TEXT2.

Oligolas · Posted 07-04-2023 07:54 AM

do you expect this?

data have;
length Value $200;
Value="0. TEXT1.";output;
Value="I. TEXT1, i.e. TEXT2.";output;
Value="II.a. TEXT1: TEXT2.";output;
Value="II.b. TEXT1: TEXT2.";output;
Value="III.a. TEXT1: TEXT2.";output;
Value="III.b. TEXT1: TEXT2.";output;
Value="III.c. TEXT1: TEXT2.";output;
Value="IV. TEXT1.";output;
Value="V.a. TEXT1: TEXT2.";output;
Value="V.b. TEXT1: TEXT2.";output;
run;
data want;
   set have;
   length Reason ReasonSp $200;
   Reason   = strip(prxchange("s/([^:,]*?)[:,](.*)/$1/", -1 , strip(Value)));
   if prxmatch('/.*[:,].*/',Value) then ReasonSp = strip(prxchange("s/.*[:,](.+)/$1/", -1, strip(Value)));
run;

________________________
- Cheers -

Oligolas · Posted 07-05-2023 04:00 AM

or maybe this:

data want;
   set have;
   length Reason ReasonSp $200;
   Reason   = strip(prxchange("s/^[0MDCLXVI]+[\.\w]+\s+([^:,]*).*/$1/", -1 , strip(Value)));
   if prxmatch('/.*[:,].*/',Value) then ReasonSp = strip(prxchange("s/.*[:,](.+)/$1/", -1, strip(Value)));
run;

You'll find more restrictive matching of roman numerals here

________________________
- Cheers -

starosto · Posted 07-05-2023 12:34 PM

Almost ok, I do not want to have i.e. and also I do not wan to have dots at the end. Thanks! 🙂

s_lassen · Posted 07-05-2023 09:04 AM

I would not use PRX for that, just SCAN:

data want;
   set have;
   reason=scan(scan(value,2,' '),1,':,.');
   if countw(value,' ')>2 then 
     ReasonSp=scan(scan(value,-1,' '),1,':,.');
run;

The reason the ReasonSp calculation is a bit more complicated is to account for the "i.e." in the second observation, otherwise it could be calculated by taking the third word - but instead we have go get the last word, but only if there are more than 2 words.

starosto · Posted 07-05-2023 12:54 PM

Thank you! This is not universal - it depends on number of words. Under TEXT1 or TEXT2 I could have the whole sentence.

Oligolas · Posted 07-06-2023 10:36 AM

It think the best way would be to perform the final cleaning in a second step.

But depending on

how much data cleaning you need to finally get TEXT1 and TEXT2
wether you want to check the conformity of the syntax or not
your facility to re-understand, debug or refine regex you wrote a time ago

you may want to come up with a solution like this:


data want;
   set have;
   length Reason ReasonSp $200;
   *Roughly extract TEXT1 and TEXT2;
   Reason   = strip(substrn(scan(value,1,':,'),find(Value,' ')));
   ReasonSp = scan(value,2,':,');

   *Data cleaning;
   array checkNClean Reason ReasonSp;
   do over checkNClean;
      *Remove trailing dot;
      if substrn(reverse(strip(checkNClean)),1,1) eq '.' then checkNClean=reverse(substrn(reverse(strip(checkNClean)),2));
      *Remove leading i.e.;
      if substrn(strip(checkNClean),1,4) eq 'i.e.' then checkNClean=strip(substrn(strip(checkNClean),5));
   end;
run;
proc print;run;

________________________
- Cheers -

ballardw · Posted 07-11-2023 06:15 PM

@starosto wrote:

Thank you! This is not universal - it depends on number of words. Under TEXT1 or TEXT2 I could have the whole sentence.

GIGO

Incomplete problem description yields incomplete solutions.

Since C, D, L and M are also "Roman Numerals" you may also need much much more description of your data and the rules involved.

Register Today!

SAS Training: Just a Click Away