BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
mosabbirfardin
Fluorite | Level 6

Hey guys,

I need to remove the words (or terms) with numbers in them from a string. I have tried compress, translate, tranwrd and prxchange but no luck.

Codes I have used:

data cleaned;
        set &SRC_DTST;
		NO_NUM_SCHAR=COMPBL(TRANSLATE(upcase(&COLUMN), " " , ".,;:?!-/\+[]%1234567890$@#){}'|^&~*<>("));
		NO_NUM_SCHAR=COMPRESS(NO_NUM_SCHAR,,'KAW');
		NO_NUM_SCHAR = prxchange('s/\s+/ /oi',-1,trim(NO_NUM_SCHAR));
		NO_NUM_SCHAR = TRANWRD(NO_NUM_SCHAR, '09'x, '');
        NO_STP_WRD=prxchange('s/\b(JR|SR|III|IV|DECD|THE|A|AN|I|HE|SHE|WE|IT|THEM|TO|AND|AS|OF|FROM|TO|ABOARD|IF|II|IV|OR|NON|ABOUT|HAVE|HAD|HOW|ONE|
                NOT|BEEN|ABOVE|ACROSS|AFTER|AGAINST|ALONG|AMID|AMONG|ANTI|AROUND|AS|AT|BEFORE|BEHIND|BELOW|BENEATH|BESIDE|BESIDES|BETWEEN|
				BEYOND|BUT|BY|CONCERNING|CONSIDERING|DESPITE|DOWN|DURING|EXCEPT|EXCEPTING|EXCLUDING|FOLLOWING|FOR|FROM|IN|INSIDE|INTO|LIKE|
				MINUS|NEAR|OF|OFF|ON|ONTO|OPPOSITE|OUTSIDE|OVER|PAST|PER|PLUS|REGARDING|ROUND|SAVE|SINCE|THAN|THROUGH|TO|TOWARD|TOWARDS|
				UNDER|UNDERNEATH|UNLIKE|UNTIL|UP|UPON|VERSUS|VIA|WITH|WITHIN|WITHOUT|FULL|TYPE|NONE|OTHER|MUST|NON|B|C|D|E|F|G|H|I|J|K|L|
				M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z)\b/ /o',-1,NO_NUM_SCHAR);
        cleaned_desc = COMPBL(STRIP(NO_STP_WRD));
		DROP NO_NUM_SCHAR NO_STP_WRD;
    run; 

/* next step is for removing duplicate words*/
data cleaned(keep=Concatenated_Categories LOV_LONG_DSC cleaned_desc);
set cleaned;
newstring=scan(cleaned_desc, 1, ' ');
do i=2 to countw(cleaned_desc,' ');
word=scan(cleaned_desc, i, ' ');
found=find(newstring, word, 'it');
if found=0 then newstring=catx(' ', newstring, word);
end;
cleaned_desc= newstring;
DROP newstring;
run;

Input:

ATN1 (atrophin 1) (eg, dentatorubral-pallidoluysian atrophy) gene analysis, evaluation to detect abnormal (eg, expanded) alleles

 

My output:

ATN ATROPHIN EG DENTATORUBRAL PALLIDOLUYSIAN ATROPHY GENE ANALYSIS EVALUATION DETECT ABNORMAL EXPANDED ALLELES

 

Expected output:

ATROPHIN EG DENTATORUBRAL PALLIDOLUYSIAN ATROPHY GENE ANALYSIS EVALUATION DETECT ABNORMAL EXPANDED ALLELES

 

Also good to have:

I also want to remove any 2 letter words from the string such as 'EG' in this case.

 

Any guidance will be greatly appreciated.

1 ACCEPTED SOLUTION

Accepted Solutions
KachiM
Rhodochrosite | Level 12

Hi @mosabbirfardin ,

 

Your requirements need 3 steps.

 

In the first step anyone character in '()-,.' is replaced by a SPACE.

The second step will look for 2-character word to replace it by a SPACE.

The third step looks for a word ending in a NUMBER to replace it by a SPACE.

 

data _null_;
txt = 'ATN1 (atrophin 1) (eg, dentatorubral-pallidoluysian atrophy) gene analysis, 
evaluation to detect abnormal (eg, expanded) alleles';
txt = translate(txt,' ','()-,.');
wc = countw(txt);
do i = 1 to wc;
   word = scan(txt, i);
   if length(word) = 2 then txt = transtrn(txt, strip(word), strip(' ')); 
   else if anydigit(word) then txt = transtrn(txt, strip(word), strip(' '));
end;
   txt = upcase(compbl(txt));
   put txt =;
run;

View solution in original post

2 REPLIES 2
andreas_lds
Jade | Level 19

Please post the data you have in usable form (data step with datalines) so that we know exactly what you have.

The following step remove all two-letter words:

data narf;
   length Text $ 200;
   input Text &;

   output;
   Text = prxchange('s/\b(\w\w)\b/ /', -1, Text);
   output;

   datalines;
If you use regular-expression-id, the PRXCHANGE function searches the variable source with the regular-expression-id that is returned by PRXPARSE. 
It returns the value in source with the changes that were specified by the regular expression. 
If there is no match, PRXCHANGE returns the unchanged value in source. 
run;
KachiM
Rhodochrosite | Level 12

Hi @mosabbirfardin ,

 

Your requirements need 3 steps.

 

In the first step anyone character in '()-,.' is replaced by a SPACE.

The second step will look for 2-character word to replace it by a SPACE.

The third step looks for a word ending in a NUMBER to replace it by a SPACE.

 

data _null_;
txt = 'ATN1 (atrophin 1) (eg, dentatorubral-pallidoluysian atrophy) gene analysis, 
evaluation to detect abnormal (eg, expanded) alleles';
txt = translate(txt,' ','()-,.');
wc = countw(txt);
do i = 1 to wc;
   word = scan(txt, i);
   if length(word) = 2 then txt = transtrn(txt, strip(word), strip(' ')); 
   else if anydigit(word) then txt = transtrn(txt, strip(word), strip(' '));
end;
   txt = upcase(compbl(txt));
   put txt =;
run;

hackathon24-white-horiz.png

The 2025 SAS Hackathon Kicks Off on June 11!

Watch the live Hackathon Kickoff to get all the essential information about the SAS Hackathon—including how to join, how to participate, and expert tips for success.

YouTube LinkedIn

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 2 replies
  • 1971 views
  • 0 likes
  • 3 in conversation