Solved: Re: How to remove the last '_' and number?

ybz12003 · Posted 07-04-2017 09:43 PM

Hello:

I would like to remove the last '_' and number from the strings below. Please advice how. Thanks.

data test1;

input Names:$100.;

datalines;

dx_1

dx_10

state_id_5

state_id_9

if_mg_hk_4

;

run;

I am looking for the results is like below.

list

dx

state_id

if_mg_hk

Jagadishkatam · Posted 07-04-2017 10:16 PM

Alternatively with perl regular expression

data test1;
input Names:$100.;
names=prxchange('s/\_\d+//',-1,compress(names));
put names=;
datalines;
dx_1
dx_10
state_id_5
state_id_9
if_mg_hk_4
;
run;

Thanks,
Jag

View solution in original post

Reeza · Posted 07-04-2017 09:46 PM

Bit of a long way around, but find out the length of the last component which will be 2 or 3 it seems and then use SUBSTR to extract from 1 to the length of the string - 2/3.

data test1;
    input Names:$100.;

    last=scan(names, -1, "_");
    word=substr(names, 1, length(names)-length(last)-1);

    datalines;
dx_1
dx_10
state_id_5
state_id_9
if_mg_hk_4
;
run;

proc print ;
run;

Jagadishkatam · Posted 07-04-2017 10:16 PM

Alternatively with perl regular expression

data test1;
input Names:$100.;
names=prxchange('s/\_\d+//',-1,compress(names));
put names=;
datalines;
dx_1
dx_10
state_id_5
state_id_9
if_mg_hk_4
;
run;

Thanks,
Jag

ybz12003 · Posted 07-04-2017 10:25 PM

What is the '\_\d+/' refering to?

SAS_inquisitive · Posted 07-04-2017 11:04 PM

Regular expression is kind of seperate language in SAS having its own syntax. This paper has nice introduction of it.

http://www2.sas.com/proceedings/sugi29/265-29.pdf

Jagadishkatam · Posted 07-05-2017 12:22 AM

\_ and \d are metacharacters

\_ : recognizes the underscore( _ )

\d : recognizes the digit (0-9)

+ : recognize the charcater/digit one or more times

\_\d+ : recognize the underscore followed by digit (one of more time)

// : replace the metacharacters identified and make them blank.

Thanks,
Jag

Patrick · Posted 07-05-2017 05:33 AM

@ybz12003

Use the syntax which returns the result you want and which is easiest for you to understand and maintain.

Below 3 coding version (word1 to word3) for you to pick and choose.

The sample code and output hopefully also illustrates that it's really important to provide "nasty" sample data which covers as many cases as you can think of.

data test1;
  infile datalines truncover;
  input Names:$100.;
  length word1 word2 word3 last $100;

  last=scan(names, -1, "_");
  word1=substr(names, 1, length(names)-length(last)-1);
  word2=substr(names, 1, findc(names,'_',-length(names))-1);
  word3=prxchange('s/_\d+\s*$//oi',1,names);
  datalines;
dx_1
dx_10
state_id_5
state_id_9
if_mg_hk_4
if_mg_hk_
if_mg_hk
if
if_mg_5_hk_4
if_mg_5_hk_
if_mg_5_hk
 
_5_tt
_5tt
 
9_5_tt
_9_5tt
;
run;

ybz12003 · Posted 07-05-2017 07:51 AM

Thanks for all of your kind help, and great explaination.

Patrick · Posted 07-05-2017 09:38 AM

@ybz12003

Even though the solution you've accepted returns the desired result for you sample data, I believe the RegEx needs a small amendment to not cause issues for your real data.

I suggest you add a $ to the RegEx to ensure that it only removes _<digits> at the end of a string.

The $ stands for "end of string"

You can also replace the -1 with a 1

-1 means that the pattern gets searched and replaces as many time as it's found. You should have only a single matching pattern at maximum so using a 1 without the minus appears appropriate.

names=prxchange('s/\_\d+$//',1,trim(names));

Without the $ ALL _<digits> patterns would get removed and a source string like ABC_5_XX_10 would become ABC_XX

And instead of compress() use trim()

Compress removes all blanks from a string - also the ones in the middle of the string - which potentially changes your source string more than what you have been asking for. Trim() only removes trailing blanks which ensures that after the last _<digits> pattern there won't be any blanks before the <end of string> $

Alternatively the RegEx could also look like:

names=prxchange('s/\_\d+\s*$//',1,names);

\s* represents 0 to many blanks and though you don't have to remove the trailing blanks anymore as now the RegEx caters for such a case.

ybz12003 · Posted 07-05-2017 09:46 AM

Thanks for your detail explaination, Patrick.

Registration is open

SAS Training: Just a Click Away