BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
ybz12003
Rhodochrosite | Level 12

Hello:

 

I would like to remove the last '_' and number from the strings below.  Please advice how.  Thanks.

 

data test1;

input Names:$100.;

datalines;

dx_1

dx_10

state_id_5

state_id_9

if_mg_hk_4

;

run;

 

I am looking for the results is like below.

list

dx

dx

state_id

state_id

if_mg_hk

1 ACCEPTED SOLUTION

Accepted Solutions
Jagadishkatam
Amethyst | Level 16

Alternatively with perl regular expression

 

data test1;
input Names:$100.;
names=prxchange('s/\_\d+//',-1,compress(names));
put names=;
datalines;
dx_1
dx_10
state_id_5
state_id_9
if_mg_hk_4
;
run;
Thanks,
Jag

View solution in original post

9 REPLIES 9
Reeza
Super User

Bit of a long way around, but find out the length of the last component which will be 2 or 3 it seems and then use SUBSTR to extract from 1 to the length of the string - 2/3.

 

data test1;
    input Names:$100.;

    last=scan(names, -1, "_");
    word=substr(names, 1, length(names)-length(last)-1);

    datalines;
dx_1
dx_10
state_id_5
state_id_9
if_mg_hk_4
;
run;

proc print ;
run;
Jagadishkatam
Amethyst | Level 16

Alternatively with perl regular expression

 

data test1;
input Names:$100.;
names=prxchange('s/\_\d+//',-1,compress(names));
put names=;
datalines;
dx_1
dx_10
state_id_5
state_id_9
if_mg_hk_4
;
run;
Thanks,
Jag
ybz12003
Rhodochrosite | Level 12

What is the '\_\d+/' refering to?

 

SAS_inquisitive
Lapis Lazuli | Level 10

Regular expression is kind of seperate  language in SAS having its own syntax. This paper has nice introduction of  it.

http://www2.sas.com/proceedings/sugi29/265-29.pdf

Jagadishkatam
Amethyst | Level 16

\_ and \d are metacharacters

 

\_ : recognizes the underscore( _ )

\d : recognizes the digit (0-9)

 + : recognize the charcater/digit one or more times

 

\_\d+ : recognize the underscore followed by digit (one of more time)

 

// : replace the metacharacters identified and make them blank.

Thanks,
Jag
Patrick
Opal | Level 21

@ybz12003

Use the syntax which returns the result you want and which is easiest for you to understand and maintain.

Below 3 coding version (word1 to word3) for you to pick and choose.

The sample code and output hopefully also illustrates that it's really important to provide "nasty" sample data which covers as many cases as you can think of.

data test1;
  infile datalines truncover;
  input Names:$100.;
  length word1 word2 word3 last $100;

  last=scan(names, -1, "_");
  word1=substr(names, 1, length(names)-length(last)-1);
  word2=substr(names, 1, findc(names,'_',-length(names))-1);
  word3=prxchange('s/_\d+\s*$//oi',1,names);
  datalines;
dx_1
dx_10
state_id_5
state_id_9
if_mg_hk_4
if_mg_hk_
if_mg_hk
if
if_mg_5_hk_4
if_mg_5_hk_
if_mg_5_hk
 
_5_tt
_5tt
 
9_5_tt
_9_5tt
;
run;

 

 Capture.JPG

ybz12003
Rhodochrosite | Level 12

Thanks for all of your kind help, and great explaination.

Patrick
Opal | Level 21

@ybz12003

Even though the solution you've accepted returns the desired result for you sample data, I believe the RegEx needs a small amendment to not cause issues for your real data.

I suggest you add a $ to the RegEx to ensure that it only removes _<digits> at the end of a string.

The $ stands for "end of string"

 

You can also replace the -1 with a 1  

-1 means that the pattern gets searched and replaces as many time as it's found. You should have only a single matching pattern at maximum so using a 1 without the minus appears appropriate.

 

names=prxchange('s/\_\d+$//',1,trim(names));

 

Without the $ ALL _<digits> patterns would get removed and a source string like ABC_5_XX_10 would become ABC_XX

 

And instead of compress() use trim()

Compress removes all blanks from a string - also the ones in the middle of the string - which potentially changes your source string more than what you have been asking for. Trim() only removes trailing blanks which ensures that after the last _<digits> pattern there won't be any blanks before the <end of string> $ 

 

Alternatively the RegEx could also look like:

names=prxchange('s/\_\d+\s*$//',1,names);

 

\s* represents 0 to many blanks and though you don't have to remove the trailing blanks anymore as now the RegEx caters for such a case. 

ybz12003
Rhodochrosite | Level 12

Thanks for your detail explaination, Patrick. 

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 3084 views
  • 5 likes
  • 5 in conversation