Hello:
I would like to remove the last '_' and number from the strings below. Please advice how. Thanks.
data test1;
input Names:$100.;
datalines;
dx_1
dx_10
state_id_5
state_id_9
if_mg_hk_4
;
run;
I am looking for the results is like below.
list
dx
dx
state_id
state_id
if_mg_hk
Alternatively with perl regular expression
data test1;
input Names:$100.;
names=prxchange('s/\_\d+//',-1,compress(names));
put names=;
datalines;
dx_1
dx_10
state_id_5
state_id_9
if_mg_hk_4
;
run;
Bit of a long way around, but find out the length of the last component which will be 2 or 3 it seems and then use SUBSTR to extract from 1 to the length of the string - 2/3.
data test1;
input Names:$100.;
last=scan(names, -1, "_");
word=substr(names, 1, length(names)-length(last)-1);
datalines;
dx_1
dx_10
state_id_5
state_id_9
if_mg_hk_4
;
run;
proc print ;
run;
Alternatively with perl regular expression
data test1;
input Names:$100.;
names=prxchange('s/\_\d+//',-1,compress(names));
put names=;
datalines;
dx_1
dx_10
state_id_5
state_id_9
if_mg_hk_4
;
run;
What is the '\_\d+/' refering to?
Regular expression is kind of seperate language in SAS having its own syntax. This paper has nice introduction of it.
\_ and \d are metacharacters
\_ : recognizes the underscore( _ )
\d : recognizes the digit (0-9)
+ : recognize the charcater/digit one or more times
\_\d+ : recognize the underscore followed by digit (one of more time)
// : replace the metacharacters identified and make them blank.
Use the syntax which returns the result you want and which is easiest for you to understand and maintain.
Below 3 coding version (word1 to word3) for you to pick and choose.
The sample code and output hopefully also illustrates that it's really important to provide "nasty" sample data which covers as many cases as you can think of.
data test1;
infile datalines truncover;
input Names:$100.;
length word1 word2 word3 last $100;
last=scan(names, -1, "_");
word1=substr(names, 1, length(names)-length(last)-1);
word2=substr(names, 1, findc(names,'_',-length(names))-1);
word3=prxchange('s/_\d+\s*$//oi',1,names);
datalines;
dx_1
dx_10
state_id_5
state_id_9
if_mg_hk_4
if_mg_hk_
if_mg_hk
if
if_mg_5_hk_4
if_mg_5_hk_
if_mg_5_hk
_5_tt
_5tt
9_5_tt
_9_5tt
;
run;
Thanks for all of your kind help, and great explaination.
Even though the solution you've accepted returns the desired result for you sample data, I believe the RegEx needs a small amendment to not cause issues for your real data.
I suggest you add a $ to the RegEx to ensure that it only removes _<digits> at the end of a string.
The $ stands for "end of string"
You can also replace the -1 with a 1
-1 means that the pattern gets searched and replaces as many time as it's found. You should have only a single matching pattern at maximum so using a 1 without the minus appears appropriate.
names=prxchange('s/\_\d+$//',1,trim(names));
Without the $ ALL _<digits> patterns would get removed and a source string like ABC_5_XX_10 would become ABC_XX
And instead of compress() use trim()
Compress removes all blanks from a string - also the ones in the middle of the string - which potentially changes your source string more than what you have been asking for. Trim() only removes trailing blanks which ensures that after the last _<digits> pattern there won't be any blanks before the <end of string> $
Alternatively the RegEx could also look like:
names=prxchange('s/\_\d+\s*$//',1,names);
\s* represents 0 to many blanks and though you don't have to remove the trailing blanks anymore as now the RegEx caters for such a case.
Thanks for your detail explaination, Patrick.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.