BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Alexxxxxxx
Pyrite | Level 9

Dear all,

 

How can I remove the 'THE' from the 'HRM_L2_Step3'variable which end in 'THE' and start with 'THE'.

for example 

for table a

THKDNAWTHE

COIOMMOO THE

THEDONNWWW

THE DOCMESSC

THE DOCMESS THE

THEIMLEMTHE

I expect to get table b

THKDNAWTHE

COIOMMOO

THEDONNWWW

DOCMESSC

DOCMESS

THEIMLEMTHE

Could you please give me some suggestions? thanks in advance

 

 

data a;
  infile datalines dlm=',' truncover;
  informat HRM_L2_Step3 $50.;
input HRM_L2_Step3 ;
datalines;
THKDNAWTHE
COIOMMOO THE
THEDONNWWW
THE DOCMESSC
THE DOCMESS THE
THEIMLEMTHE
run;

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
PeterClemmensen
Tourmaline | Level 20

Here is one way

 

data a;
  infile datalines dlm=',' truncover;
  informat HRM_L2_Step3 $50.;
input HRM_L2_Step3 ;
datalines;
THKDNAWTHE
COIOMMOO THE
THEDONNWWW
THE DOCMESSC
THE DOCMESS THE
THEIMLEMTHE
run;

data b;
   set a;
   HRM_L2_Step3=prxchange('s/\sTHE|THE\s//',-1,HRM_L2_Step3);
run;

View solution in original post

12 REPLIES 12
PeterClemmensen
Tourmaline | Level 20

Here is one way

 

data a;
  infile datalines dlm=',' truncover;
  informat HRM_L2_Step3 $50.;
input HRM_L2_Step3 ;
datalines;
THKDNAWTHE
COIOMMOO THE
THEDONNWWW
THE DOCMESSC
THE DOCMESS THE
THEIMLEMTHE
run;

data b;
   set a;
   HRM_L2_Step3=prxchange('s/\sTHE|THE\s//',-1,HRM_L2_Step3);
run;
Alexxxxxxx
Pyrite | Level 9

Dear draycut,

 

thanks for your advice.

 

Could you please explain the meaning of '

prxchange('s/\sTHE|THE\s//',-1,

'

 

I really expect to know how it works. 

thanks for your attention to this matter.

PeterClemmensen
Tourmaline | Level 20

Sure, I will give it a try. However, if you are not familiar with regular expressions, some of it may not make sense 🙂

 

The PRXCHANGE Functions first argument is's/\sTHE|THE\s//'. 

 

The first s means that I want to substitute the pattern between the first and second slash with the text between the second and third slash (in this case nothing). The \sTHE means that I search for a pattern matching a whitespace followed by the string "THE". | is an 'or symbol', meaning that I search for a pattern matching either the string "THE" with a blank before OR after it.

 

The -1 in the next argument means that I keep searching for and replacing the matched patterns until the end of the input string.

 

Hope this makes sense

 

 

 

Alexxxxxxx
Pyrite | Level 9
I really appreciate your advice. I have understood your meaning
Alexxxxxxx
Pyrite | Level 9

Dear draycut,

 

I face a new problem during the process.

 

I do not want to remove the 'THE' form below observations during the process.

 

AGI THERAPEUTICS

RAPEUTICSTHE AGI

 

Could you please give me some suggestions?

noling
SAS Employee

Tranwrd is nice for replacing text. The transtrN can replace text with "" (nothing).

 

data want;
  set a;
  HRM_L2_Step3 = transtrN(HRM_L2_Step3,"THE",trimn(''));
run;


Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF

View now: on-demand content for SAS users

Astounding
PROC Star

Just in case the previous suggestions aren't 100% working, here is a possibility:

 

data want;

set have;

if HRM_L2_Step3 =: 'THE ' then HRM_L2_Step3 = substr(HRM_L2_Step3, 4);

if scan(HRM_L2_Step3, -1) = 'THE' then 

substr(HRM_L2_Step3, length(HRM_L2_Step3) - 2, 3) = '   ';

run;

Alexxxxxxx
Pyrite | Level 9

Dear Astounding,

 

thanks for your advice.

 

However, for 'THE DOCMESSC', I get ' DOCMESSC'. How can I remove the blank which before 'DOCMESSC'?

 

Could you please give me some suggestions?

 

Astounding
PROC Star
Sorry, my fault on this line:

substr(HRM_L2_Step3, 4);

Change it to:

substr(HRM_L2_Step3, 5);
Alexxxxxxx
Pyrite | Level 9

Thanks, Astounding, it works. 

However, I have a new problem during the process.

For 'THE * ALUMINIUM POWDER COMPANY', how can I remove 'THE * ' (i.e., only keep 'ALUMINIUM POWDER COMPANY')?

 

besides, what will happen if I set 

 

data want ;

set a;

if HRM_L2_Step3 =: 'THE ' then HRM_L2_Step3 = substr(HRM_L2_Step3, 5);

if scan(HRM_L2_Step3, -1) = 'THE' then

substr(HRM_L2_Step3, length(HRM_L2_Step3) - 2, 3) = '';

run;

 

rather than 

data want ;

set a;

if HRM_L2_Step3 =: 'THE ' then HRM_L2_Step3 = substr(HRM_L2_Step3, 5);

if scan(HRM_L2_Step3, -1) = 'THE' then

substr(HRM_L2_Step3, length(HRM_L2_Step3) - 2, 3) = '   ';

run;

 

I did not find any difference when I delete the blanks between two ' 

 

thanks for your help.

 

 

thanks in advance.

Astounding
PROC Star

I was sure that using 3 blanks would work.  I wasn't sure what would happen if I switched to 1 (or none), so used what I knew would work.

 

There are probably many characters that you might want to remove, that your original specifications didn't cover.  What would happen, for example, if you had THE plus 2 blanks at the beginning of a line?  What if any of these characters were the 5th character in the line, following "THE "?

 

* (as you mentioned)

-

.

$

%

#

 

Perhaps you should run this program after cleaning up the "THE" occurrences:

 

proc freq data=have;

tables HRM_L2_Step3;

format HRM_L2_Step3 $1.;

run;

 

That will show you how many times each possible first character actually appears in the entries that remain.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 949 views
  • 1 like
  • 4 in conversation