BookmarkSubscribeRSS Feed
Tom
Super User Tom
Super User

I don't understand how the macro processor comes into this.

Your RegEx code is running in a data step.

ChrisNZ
Tourmaline | Level 20

This is all you need:

NAMES[I] = prxchange("s/&BAD_PART|[^A-Z]+/ /", -1, trim(NAMES[I]));
NAMES[I] = left(compbl(NAMES[I])); 

The + and the trim() serve no purpose but to improve performance.

 

 

GingerJJ
Obsidian | Level 7

Thank you all that helped!

I'm really touched by so much support here!

GingerJJ
Obsidian | Level 7


%LET BAD_PART=DUPLICATE|DO NOT USE|DONOT USE|FILE|RECORD|DO NOT ISSUE|DONT USE|DE LA | DE LA | DEL | JR|;

data test1;
infile datalines delimiter = "," truncover;
input c_ln :$50. c_fn :$50.;
datalines;
DUPLICATE-DO NOT USE, JOHN234
DO NOT USE (DUPLICATE), DUPLICATE RECORD
DO NOT USE DUPLICATE FILE, [DONOT USE]
DUPLICATE FILE DUPLICATE FILE, 'DONT USE'
(DUPLICATE),"DUPLICATE"
DE LA MARIA, MARIA JR.
MARIA DEL ROSA, MARIA DELROSA
MARIA DE LA ROSA, MARIA LAROSA
;
run;

DATA test1a;
SET test1;
ARRAY ORIGINAL [*] $ 40 C_LN C_FN;

ARRAY NAMES[*] $ 40 LN FN ;
DO I = 1 to 2;
NAMES [I]= ORIGINAL [I];

END;
DO I = 1 to 2;
/* turn anything that is NOT a letter into space*/
NAMES[I] = prxchange("s/[^A-Z]/ /",-1,NAMES[I]);
NAMES[I] = prxchange("s/&BAD_PART/ /",-1, left(trim(NAMES[I])));
NAMES[I] = STRIP(NAMES[I]);
NAMES[I] = COMPBL(NAMES[I]); 

END;
DROP I;
RUN;

 

The results for the last two lines are:

M A R I A R O S A       M A R I A D E L R O S A
M A R I A R O S A       M A R I A L A R O S A

 

But I still want a space after getting rid of "DE LA" or "DEL" if they are in the middle of the names.

 

Thanks.

 

GingerJJ
Obsidian | Level 7

Actually, there is a space in between each letter for some reason...

@maguiremq @Tom 

Tom
Super User Tom
Super User

@GingerJJ wrote:

Actually, there is a space in between each letter for some reason...

@maguiremq @Tom 


Because you told it to insert a space between each character in the first RegEx.

Remove the trailing | from your macro variable.

GingerJJ
Obsidian | Level 7

You mean "|" in this?

%LET BAD_PART=DUPLICATE|DO NOT USE|DONOT USE|FILE|RECORD|DO NOT ISSUE|DONT USE|DE LA | DE LA | DEL | JR|;

 

If I remove |, how do I tell SAS these are separate parts with an "or" relationship?

GingerJJ
Obsidian | Level 7

I changed the macro to :

%LET BAD_PART=DUPLICATE,DO NOT USE,DONOT USE,FILE|RECORD,DO NOT ISSUE,DONT USE,DE LA, DE LA , DEL , JR,;

 

It didn't work.

@Tom 

Tom
Super User Tom
Super User

Work on fixing the conversion code without the complexity of the ARRAY.  Once you have that logic working then add the array back.

 

Create example data with possible inputs and expected output.  Then convert and test if the result is what you expected.

%LET BAD_PART=DUPLICATE|DO NOT USE|DONOT USE|FILE|RECORD|DO NOT ISSUE|DONT USE| DE LA | DEL | JR$;
data example;
  infile datalines dlm='|' truncover;
  input have :$50. expect :$50.;
datalines;
DUPLICATE-DO NOT USE|
JOHN234|JOHN
DO NOT USE (DUPLICATE)|
DUPLICATE RECORD|
DO NOT USE DUPLICATE FILE|
[DONOT USE]|
DUPLICATE FILE DUPLICATE FILE|
'DONT USE'|
(DUPLICATE)|
"""DUPLICATE"""|
DE LA MARIA|DE LA MARIA
MARIA JR.|MARIA
MARIA DEL ROSA|MARIA ROSA
MARIA DELROSA|MARIA DELROSA
MARIA DE LA ROSA|MARIA ROSA
MARIA LAROSA|MARIA LAROSA
;

data out ;
  length status 8 try $50 ;
  set example ;
  * turn anything that is not a letter into single space ;
  try =  prxchange("s/[^a-z]+/ /i",-1,have);
  * convert bad strings into space ;
  try = prxchange("s/&bad_part/ /",-1, strip(try));
  * collapse to single space and remove leading spaces ;
  try= left(compbl(try));
  status= try=expect;
run;
proc print;
  title "&bad_part";
run;
title;

Results:

DUPLICATE|DO NOT USE|DONOT USE|FILE|RECORD|DO NOT ISSUE|DONT USE| DE LA | DEL | JR$ 

Obs    status    try              have                             expect

  1       1                       DUPLICATE-DO NOT USE
  2       1      JOHN             JOHN234                          JOHN
  3       1                       DO NOT USE (DUPLICATE)
  4       1                       DUPLICATE RECORD
  5       1                       DO NOT USE DUPLICATE FILE
  6       1                       [DONOT USE]
  7       1                       DUPLICATE FILE DUPLICATE FILE
  8       1                       'DONT USE'
  9       1                       (DUPLICATE)
 10       1                       """DUPLICATE"""
 11       1      DE LA MARIA      DE LA MARIA                      DE LA MARIA
 12       1      MARIA            MARIA JR.                        MARIA
 13       1      MARIA ROSA       MARIA DEL ROSA                   MARIA ROSA
 14       1      MARIA DELROSA    MARIA DELROSA                    MARIA DELROSA
 15       1      MARIA ROSA       MARIA DE LA ROSA                 MARIA ROSA
 16       1      MARIA LAROSA     MARIA LAROSA                     MARIA LAROSA
GingerJJ
Obsidian | Level 7

It worked and the only changes I need to make are (colored below):

%LET BAD_PART=DUPLICATE|DO NOT USE|DONOT USE|FILE|RECORD|DO NOT ISSUE|DONT USE|DE LA | DE LA | DEL | JR$;

NAMES[I] = prxchange("s/[^A-Z]/ /i",-1,NAMES[I]);

 

What is the i for? Adding $ is also critical here. So I guess I always need to specify if it is string?

@Tom 

Tom
Super User Tom
Super User

The key thing was removing the | from the END of the macro variable.  That caused you to add an empty string to the list of possible values so that you were essentially running this regular expression

s// /

Which says to replace nothing with a space, hence the spaces between every letter.

 

The i after the closing / means to ignore the case.  I couldn't stand reading your code in ALL CAPITALS so when I convert everything to lowercase I added that so it would still match your uppercase strings.

 

The $ has special meaning in regular expressions. I means match only at the END of the string.  Otherwise it would remove the JR from a longer string.   JR is probably not a common string to be the beginning of a longer name, but you never know.  Try it with and without the $. 

 

Note also that your current list of bad strings will remove the letters FILE from the middle of a longer name like DEFILESE.

 

The reason it does not match DE LA at the beginning of a name is because you have a leading space in your search pattern.

| DE LA |

You might need to perhaps be consistent and add a space around every search term and add spaces around the string to search.

%LET BAD_PART=DUPLICATE | DO NOT USE | DONOT USE | FILE | RECORD | DO NOT ISSUE | DONT USE | DE LA | DEL | JR $;
NAMES[I] = prxchange("s/ &bad_part/ /",-1,' ' || NAMES[I] || ' ');

But that means DE LA at the beginning of the string will also be removed.

 

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 25 replies
  • 1663 views
  • 8 likes
  • 4 in conversation