I don't understand how the macro processor comes into this.
Your RegEx code is running in a data step.
This is all you need:
NAMES[I] = prxchange("s/&BAD_PART|[^A-Z]+/ /", -1, trim(NAMES[I]));
NAMES[I] = left(compbl(NAMES[I]));
The + and the trim() serve no purpose but to improve performance.
Thank you all that helped!
I'm really touched by so much support here!
%LET BAD_PART=DUPLICATE|DO NOT USE|DONOT USE|FILE|RECORD|DO NOT ISSUE|DONT USE|DE LA | DE LA | DEL | JR|;
data test1;
infile datalines delimiter = "," truncover;
input c_ln :$50. c_fn :$50.;
datalines;
DUPLICATE-DO NOT USE, JOHN234
DO NOT USE (DUPLICATE), DUPLICATE RECORD
DO NOT USE DUPLICATE FILE, [DONOT USE]
DUPLICATE FILE DUPLICATE FILE, 'DONT USE'
(DUPLICATE),"DUPLICATE"
DE LA MARIA, MARIA JR.
MARIA DEL ROSA, MARIA DELROSA
MARIA DE LA ROSA, MARIA LAROSA
;
run;
DATA test1a;
SET test1;
ARRAY ORIGINAL [*] $ 40 C_LN C_FN;
ARRAY NAMES[*] $ 40 LN FN ;
DO I = 1 to 2;
NAMES [I]= ORIGINAL [I];
END;
DO I = 1 to 2;
/* turn anything that is NOT a letter into space*/
NAMES[I] = prxchange("s/[^A-Z]/ /",-1,NAMES[I]);
NAMES[I] = prxchange("s/&BAD_PART/ /",-1, left(trim(NAMES[I])));
NAMES[I] = STRIP(NAMES[I]);
NAMES[I] = COMPBL(NAMES[I]);
END;
DROP I;
RUN;
The results for the last two lines are:
M A R I A R O S A M A R I A D E L R O S A
M A R I A R O S A M A R I A L A R O S A
But I still want a space after getting rid of "DE LA" or "DEL" if they are in the middle of the names.
Thanks.
Actually, there is a space in between each letter for some reason...
@GingerJJ wrote:
Actually, there is a space in between each letter for some reason...
Because you told it to insert a space between each character in the first RegEx.
Remove the trailing | from your macro variable.
You mean "|" in this?
%LET BAD_PART=DUPLICATE|DO NOT USE|DONOT USE|FILE|RECORD|DO NOT ISSUE|DONT USE|DE LA | DE LA | DEL | JR|;
If I remove |, how do I tell SAS these are separate parts with an "or" relationship?
I changed the macro to :
%LET BAD_PART=DUPLICATE,DO NOT USE,DONOT USE,FILE|RECORD,DO NOT ISSUE,DONT USE,DE LA, DE LA , DEL , JR,;
It didn't work.
Work on fixing the conversion code without the complexity of the ARRAY. Once you have that logic working then add the array back.
Create example data with possible inputs and expected output. Then convert and test if the result is what you expected.
%LET BAD_PART=DUPLICATE|DO NOT USE|DONOT USE|FILE|RECORD|DO NOT ISSUE|DONT USE| DE LA | DEL | JR$;
data example;
infile datalines dlm='|' truncover;
input have :$50. expect :$50.;
datalines;
DUPLICATE-DO NOT USE|
JOHN234|JOHN
DO NOT USE (DUPLICATE)|
DUPLICATE RECORD|
DO NOT USE DUPLICATE FILE|
[DONOT USE]|
DUPLICATE FILE DUPLICATE FILE|
'DONT USE'|
(DUPLICATE)|
"""DUPLICATE"""|
DE LA MARIA|DE LA MARIA
MARIA JR.|MARIA
MARIA DEL ROSA|MARIA ROSA
MARIA DELROSA|MARIA DELROSA
MARIA DE LA ROSA|MARIA ROSA
MARIA LAROSA|MARIA LAROSA
;
data out ;
length status 8 try $50 ;
set example ;
* turn anything that is not a letter into single space ;
try = prxchange("s/[^a-z]+/ /i",-1,have);
* convert bad strings into space ;
try = prxchange("s/&bad_part/ /",-1, strip(try));
* collapse to single space and remove leading spaces ;
try= left(compbl(try));
status= try=expect;
run;
proc print;
title "&bad_part";
run;
title;
Results:
DUPLICATE|DO NOT USE|DONOT USE|FILE|RECORD|DO NOT ISSUE|DONT USE| DE LA | DEL | JR$ Obs status try have expect 1 1 DUPLICATE-DO NOT USE 2 1 JOHN JOHN234 JOHN 3 1 DO NOT USE (DUPLICATE) 4 1 DUPLICATE RECORD 5 1 DO NOT USE DUPLICATE FILE 6 1 [DONOT USE] 7 1 DUPLICATE FILE DUPLICATE FILE 8 1 'DONT USE' 9 1 (DUPLICATE) 10 1 """DUPLICATE""" 11 1 DE LA MARIA DE LA MARIA DE LA MARIA 12 1 MARIA MARIA JR. MARIA 13 1 MARIA ROSA MARIA DEL ROSA MARIA ROSA 14 1 MARIA DELROSA MARIA DELROSA MARIA DELROSA 15 1 MARIA ROSA MARIA DE LA ROSA MARIA ROSA 16 1 MARIA LAROSA MARIA LAROSA MARIA LAROSA
It worked and the only changes I need to make are (colored below):
%LET BAD_PART=DUPLICATE|DO NOT USE|DONOT USE|FILE|RECORD|DO NOT ISSUE|DONT USE|DE LA | DE LA | DEL | JR$;
NAMES[I] = prxchange("s/[^A-Z]/ /i",-1,NAMES[I]);
What is the i for? Adding $ is also critical here. So I guess I always need to specify if it is string?
The key thing was removing the | from the END of the macro variable. That caused you to add an empty string to the list of possible values so that you were essentially running this regular expression
s// /
Which says to replace nothing with a space, hence the spaces between every letter.
The i after the closing / means to ignore the case. I couldn't stand reading your code in ALL CAPITALS so when I convert everything to lowercase I added that so it would still match your uppercase strings.
The $ has special meaning in regular expressions. I means match only at the END of the string. Otherwise it would remove the JR from a longer string. JR is probably not a common string to be the beginning of a longer name, but you never know. Try it with and without the $.
Note also that your current list of bad strings will remove the letters FILE from the middle of a longer name like DEFILESE.
The reason it does not match DE LA at the beginning of a name is because you have a leading space in your search pattern.
| DE LA |
You might need to perhaps be consistent and add a space around every search term and add spaces around the string to search.
%LET BAD_PART=DUPLICATE | DO NOT USE | DONOT USE | FILE | RECORD | DO NOT ISSUE | DONT USE | DE LA | DEL | JR $;
NAMES[I] = prxchange("s/ &bad_part/ /",-1,' ' || NAMES[I] || ' ');
But that means DE LA at the beginning of the string will also be removed.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.