Solved: remove the 'THE' from the 'variable' which end in 'THE' and start with...

Alexxxxxxx · Posted 03-05-2019 02:15 PM

Dear all,

How can I remove the 'THE' from the 'HRM_L2_Step3'variable which end in 'THE' and start with 'THE'.

for example

for table a

THKDNAWTHE

COIOMMOO THE

THEDONNWWW

THE DOCMESSC

THE DOCMESS THE

THEIMLEMTHE

I expect to get table b

THKDNAWTHE

COIOMMOO

THEDONNWWW

DOCMESSC

DOCMESS

THEIMLEMTHE

Could you please give me some suggestions? thanks in advance

data a;
  infile datalines dlm=',' truncover;
  informat HRM_L2_Step3 $50.;
input HRM_L2_Step3 ;
datalines;
THKDNAWTHE
COIOMMOO THE
THEDONNWWW
THE DOCMESSC
THE DOCMESS THE
THEIMLEMTHE
run;

PeterClemmensen · Posted 03-05-2019 03:05 PM

Here is one way

data a;
  infile datalines dlm=',' truncover;
  informat HRM_L2_Step3 $50.;
input HRM_L2_Step3 ;
datalines;
THKDNAWTHE
COIOMMOO THE
THEDONNWWW
THE DOCMESSC
THE DOCMESS THE
THEIMLEMTHE
run;

data b;
   set a;
   HRM_L2_Step3=prxchange('s/\sTHE|THE\s//',-1,HRM_L2_Step3);
run;

The DATA to DATA Step Macro
Blog: SASnrd

View solution in original post

PeterClemmensen · Posted 03-05-2019 03:05 PM

Here is one way

data a;
  infile datalines dlm=',' truncover;
  informat HRM_L2_Step3 $50.;
input HRM_L2_Step3 ;
datalines;
THKDNAWTHE
COIOMMOO THE
THEDONNWWW
THE DOCMESSC
THE DOCMESS THE
THEIMLEMTHE
run;

data b;
   set a;
   HRM_L2_Step3=prxchange('s/\sTHE|THE\s//',-1,HRM_L2_Step3);
run;

The DATA to DATA Step Macro
Blog: SASnrd

Alexxxxxxx · Posted 03-05-2019 06:38 PM

Dear draycut,

thanks for your advice.

Could you please explain the meaning of '

prxchange('s/\sTHE|THE\s//',-1,

'

I really expect to know how it works.

thanks for your attention to this matter.

PeterClemmensen · Posted 03-05-2019 06:53 PM

Sure, I will give it a try. However, if you are not familiar with regular expressions, some of it may not make sense 🙂

The PRXCHANGE Functions first argument is's/\sTHE|THE\s//'.

The first s means that I want to substitute the pattern between the first and second slash with the text between the second and third slash (in this case nothing). The \sTHE means that I search for a pattern matching a whitespace followed by the string "THE". | is an 'or symbol', meaning that I search for a pattern matching either the string "THE" with a blank before OR after it.

The -1 in the next argument means that I keep searching for and replacing the matched patterns until the end of the input string.

Hope this makes sense

The DATA to DATA Step Macro
Blog: SASnrd

Alexxxxxxx · Posted 03-05-2019 07:13 PM

I really appreciate your advice. I have understood your meaning

PeterClemmensen · Posted 03-05-2019 07:18 PM

I'm glad you found your answer 🙂

The DATA to DATA Step Macro
Blog: SASnrd

Alexxxxxxx · Posted 03-05-2019 08:25 PM

Dear draycut,

I face a new problem during the process.

I do not want to remove the 'THE' form below observations during the process.

AGI THERAPEUTICS

RAPEUTICSTHE AGI

Could you please give me some suggestions?

noling · Posted 03-05-2019 04:07 PM

Tranwrd is nice for replacing text. The transtrN can replace text with "" (nothing).

data want;
set a;
HRM_L2_Step3 = transtrN(HRM_L2_Step3,"THE",trimn(''));
run;

Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF

View now: on-demand content for SAS users

Astounding · Posted 03-05-2019 04:27 PM

Just in case the previous suggestions aren't 100% working, here is a possibility:

data want;

set have;

if HRM_L2_Step3 =: 'THE ' then HRM_L2_Step3 = substr(HRM_L2_Step3, 4);

if scan(HRM_L2_Step3, -1) = 'THE' then

substr(HRM_L2_Step3, length(HRM_L2_Step3) - 2, 3) = ' ';

run;

Alexxxxxxx · Posted 03-05-2019 08:32 PM

Dear Astounding,

thanks for your advice.

However, for 'THE DOCMESSC', I get ' DOCMESSC'. How can I remove the blank which before 'DOCMESSC'?

Could you please give me some suggestions?

Astounding · Posted 03-05-2019 08:48 PM

Sorry, my fault on this line:

substr(HRM_L2_Step3, 4);

Change it to:

substr(HRM_L2_Step3, 5);

Alexxxxxxx · Posted 03-05-2019 09:37 PM

Thanks, Astounding, it works.

However, I have a new problem during the process.

For 'THE * ALUMINIUM POWDER COMPANY', how can I remove 'THE * ' (i.e., only keep 'ALUMINIUM POWDER COMPANY')?

besides, what will happen if I set

data want ;

set a;

if HRM_L2_Step3 =: 'THE ' then HRM_L2_Step3 = substr(HRM_L2_Step3, 5);

if scan(HRM_L2_Step3, -1) = 'THE' then

substr(HRM_L2_Step3, length(HRM_L2_Step3) - 2, 3) = '';

run;

rather than

data want ;

set a;

if HRM_L2_Step3 =: 'THE ' then HRM_L2_Step3 = substr(HRM_L2_Step3, 5);

if scan(HRM_L2_Step3, -1) = 'THE' then

substr(HRM_L2_Step3, length(HRM_L2_Step3) - 2, 3) = '   ';

run;

I did not find any difference when I delete the blanks between two '

thanks for your help.

thanks in advance.

Astounding · Posted 03-05-2019 10:25 PM

I was sure that using 3 blanks would work. I wasn't sure what would happen if I switched to 1 (or none), so used what I knew would work.

There are probably many characters that you might want to remove, that your original specifications didn't cover. What would happen, for example, if you had THE plus 2 blanks at the beginning of a line? What if any of these characters were the 5th character in the line, following "THE "?

* (as you mentioned)

-

.

$

%

#

Perhaps you should run this program after cleaning up the "THE" occurrences:

proc freq data=have;

tables HRM_L2_Step3;

format HRM_L2_Step3 $1.;

run;

That will show you how many times each possible first character actually appears in the entries that remain.

remove the 'THE' from the 'variable' which end in 'THE' and start with 'THE'.

Re: remove the 'THE' from the 'variable' which end in 'THE' and start with 'THE'.

Re: remove the 'THE' from the 'variable' which end in 'THE' and start with 'THE'.

Re: remove the 'THE' from the 'variable' which end in 'THE' and start with 'THE'.

Re: remove the 'THE' from the 'variable' which end in 'THE' and start with 'THE'.

Re: remove the 'THE' from the 'variable' which end in 'THE' and start with 'THE'.

Re: remove the 'THE' from the 'variable' which end in 'THE' and start with 'THE'.

Re: remove the 'THE' from the 'variable' which end in 'THE' and start with 'THE'.

Re: remove the 'THE' from the 'variable' which end in 'THE' and start with 'THE'.

Re: remove the 'THE' from the 'variable' which end in 'THE' and start with 'THE'.

Re: remove the 'THE' from the 'variable' which end in 'THE' and start with 'THE'.

Re: remove the 'THE' from the 'variable' which end in 'THE' and start with 'THE'.

Re: remove the 'THE' from the 'variable' which end in 'THE' and start with 'THE'.

Re: remove the 'THE' from the 'variable' which end in 'THE' and start with 'THE'.

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away