Re: prxchange with macro - Page 2

Tom · Posted 08-20-2021 04:30 PM

I don't understand how the macro processor comes into this.

Your RegEx code is running in a data step.

ChrisNZ · Posted 08-22-2021 08:30 PM

This is all you need:

NAMES[I] = prxchange("s/&BAD_PART|[^A-Z]+/ /", -1, trim(NAMES[I]));
NAMES[I] = left(compbl(NAMES[I]));

The + and the trim() serve no purpose but to improve performance.

High-Performance SAS Coding - Third Edition

GingerJJ · Posted 08-24-2021 11:05 AM

Thank you all that helped!

I'm really touched by so much support here!

GingerJJ · Posted 08-27-2021 01:44 PM

data test1;
infile datalines delimiter = "," truncover;
input c_ln :$50. c_fn :$50.;
datalines;
DUPLICATE-DO NOT USE, JOHN234
DO NOT USE (DUPLICATE), DUPLICATE RECORD
DO NOT USE DUPLICATE FILE, [DONOT USE]
DUPLICATE FILE DUPLICATE FILE, 'DONT USE'
(DUPLICATE),"DUPLICATE"
DE LA MARIA, MARIA JR.
MARIA DEL ROSA, MARIA DELROSA
MARIA DE LA ROSA, MARIA LAROSA
;
run;

DATA test1a;
SET test1;
ARRAY ORIGINAL [*] $ 40 C_LN C_FN;

ARRAY NAMES[*] $ 40 LN FN ;
DO I = 1 to 2;
NAMES [I]= ORIGINAL [I];

END;
DO I = 1 to 2;
/* turn anything that is NOT a letter into space*/
NAMES[I] = prxchange("s/[^A-Z]/ /",-1,NAMES[I]);
NAMES[I] = prxchange("s/&BAD_PART/ /",-1, left(trim(NAMES[I])));
NAMES[I] = STRIP(NAMES[I]);
NAMES[I] = COMPBL(NAMES[I]);

END;
DROP I;
RUN;

The results for the last two lines are:

M A R I A R O S A M A R I A D E L R O S A
M A R I A R O S A M A R I A L A R O S A

But I still want a space after getting rid of "DE LA" or "DEL" if they are in the middle of the names.

Thanks.

GingerJJ · Posted 08-27-2021 02:46 PM

Actually, there is a space in between each letter for some reason...

@maguiremq @Tom

Tom · Posted 08-27-2021 03:30 PM

@GingerJJ wrote:

Actually, there is a space in between each letter for some reason...

@maguiremq @Tom

Because you told it to insert a space between each character in the first RegEx.

Remove the trailing | from your macro variable.

GingerJJ · Posted 08-27-2021 03:41 PM

You mean "|" in this?

If I remove |, how do I tell SAS these are separate parts with an "or" relationship?

GingerJJ · Posted 08-27-2021 03:49 PM

I changed the macro to :

%LET BAD_PART=DUPLICATE,DO NOT USE,DONOT USE,FILE|RECORD,DO NOT ISSUE,DONT USE,DE LA, DE LA , DEL , JR,;

It didn't work.

@Tom

Tom · Posted 08-27-2021 03:57 PM

Work on fixing the conversion code without the complexity of the ARRAY. Once you have that logic working then add the array back.

Create example data with possible inputs and expected output. Then convert and test if the result is what you expected.

%LET BAD_PART=DUPLICATE|DO NOT USE|DONOT USE|FILE|RECORD|DO NOT ISSUE|DONT USE| DE LA | DEL | JR$;
data example;
  infile datalines dlm='|' truncover;
  input have :$50. expect :$50.;
datalines;
DUPLICATE-DO NOT USE|
JOHN234|JOHN
DO NOT USE (DUPLICATE)|
DUPLICATE RECORD|
DO NOT USE DUPLICATE FILE|
[DONOT USE]|
DUPLICATE FILE DUPLICATE FILE|
'DONT USE'|
(DUPLICATE)|
"""DUPLICATE"""|
DE LA MARIA|DE LA MARIA
MARIA JR.|MARIA
MARIA DEL ROSA|MARIA ROSA
MARIA DELROSA|MARIA DELROSA
MARIA DE LA ROSA|MARIA ROSA
MARIA LAROSA|MARIA LAROSA
;

data out ;
  length status 8 try $50 ;
  set example ;
  * turn anything that is not a letter into single space ;
  try =  prxchange("s/[^a-z]+/ /i",-1,have);
  * convert bad strings into space ;
  try = prxchange("s/&bad_part/ /",-1, strip(try));
  * collapse to single space and remove leading spaces ;
  try= left(compbl(try));
  status= try=expect;
run;
proc print;
  title "&bad_part";
run;
title;

Results:

DUPLICATE|DO NOT USE|DONOT USE|FILE|RECORD|DO NOT ISSUE|DONT USE| DE LA | DEL | JR$ 

Obs    status    try              have                             expect

  1       1                       DUPLICATE-DO NOT USE
  2       1      JOHN             JOHN234                          JOHN
  3       1                       DO NOT USE (DUPLICATE)
  4       1                       DUPLICATE RECORD
  5       1                       DO NOT USE DUPLICATE FILE
  6       1                       [DONOT USE]
  7       1                       DUPLICATE FILE DUPLICATE FILE
  8       1                       'DONT USE'
  9       1                       (DUPLICATE)
 10       1                       """DUPLICATE"""
 11       1      DE LA MARIA      DE LA MARIA                      DE LA MARIA
 12       1      MARIA            MARIA JR.                        MARIA
 13       1      MARIA ROSA       MARIA DEL ROSA                   MARIA ROSA
 14       1      MARIA DELROSA    MARIA DELROSA                    MARIA DELROSA
 15       1      MARIA ROSA       MARIA DE LA ROSA                 MARIA ROSA
 16       1      MARIA LAROSA     MARIA LAROSA                     MARIA LAROSA

GingerJJ · Posted 08-27-2021 04:18 PM

It worked and the only changes I need to make are (colored below):

NAMES[I] = prxchange("s/[^A-Z]/ /i",-1,NAMES[I]);

What is the i for? Adding $ is also critical here. So I guess I always need to specify if it is string?

@Tom

Tom · Posted 08-27-2021 04:57 PM

The key thing was removing the | from the END of the macro variable. That caused you to add an empty string to the list of possible values so that you were essentially running this regular expression

s// /

Which says to replace nothing with a space, hence the spaces between every letter.

The i after the closing / means to ignore the case. I couldn't stand reading your code in ALL CAPITALS so when I convert everything to lowercase I added that so it would still match your uppercase strings.

The $ has special meaning in regular expressions. I means match only at the END of the string. Otherwise it would remove the JR from a longer string. JR is probably not a common string to be the beginning of a longer name, but you never know. Try it with and without the $.

Note also that your current list of bad strings will remove the letters FILE from the middle of a longer name like DEFILESE.

The reason it does not match DE LA at the beginning of a name is because you have a leading space in your search pattern.

| DE LA |

You might need to perhaps be consistent and add a space around every search term and add spaces around the string to search.

%LET BAD_PART=DUPLICATE | DO NOT USE | DONOT USE | FILE | RECORD | DO NOT ISSUE | DONT USE | DE LA | DEL | JR $;
NAMES[I] = prxchange("s/ &bad_part/ /",-1,' ' || NAMES[I] || ' ');

But that means DE LA at the beginning of the string will also be removed.

SAS Innovate 2025: Call for Content

Classroom Training Available!