Hey guys,
I need to remove the words (or terms) with numbers in them from a string. I have tried compress, translate, tranwrd and prxchange but no luck.
Codes I have used:
data cleaned;
set &SRC_DTST;
NO_NUM_SCHAR=COMPBL(TRANSLATE(upcase(&COLUMN), " " , ".,;:?!-/\+[]%1234567890$@#){}'|^&~*<>("));
NO_NUM_SCHAR=COMPRESS(NO_NUM_SCHAR,,'KAW');
NO_NUM_SCHAR = prxchange('s/\s+/ /oi',-1,trim(NO_NUM_SCHAR));
NO_NUM_SCHAR = TRANWRD(NO_NUM_SCHAR, '09'x, '');
NO_STP_WRD=prxchange('s/\b(JR|SR|III|IV|DECD|THE|A|AN|I|HE|SHE|WE|IT|THEM|TO|AND|AS|OF|FROM|TO|ABOARD|IF|II|IV|OR|NON|ABOUT|HAVE|HAD|HOW|ONE|
NOT|BEEN|ABOVE|ACROSS|AFTER|AGAINST|ALONG|AMID|AMONG|ANTI|AROUND|AS|AT|BEFORE|BEHIND|BELOW|BENEATH|BESIDE|BESIDES|BETWEEN|
BEYOND|BUT|BY|CONCERNING|CONSIDERING|DESPITE|DOWN|DURING|EXCEPT|EXCEPTING|EXCLUDING|FOLLOWING|FOR|FROM|IN|INSIDE|INTO|LIKE|
MINUS|NEAR|OF|OFF|ON|ONTO|OPPOSITE|OUTSIDE|OVER|PAST|PER|PLUS|REGARDING|ROUND|SAVE|SINCE|THAN|THROUGH|TO|TOWARD|TOWARDS|
UNDER|UNDERNEATH|UNLIKE|UNTIL|UP|UPON|VERSUS|VIA|WITH|WITHIN|WITHOUT|FULL|TYPE|NONE|OTHER|MUST|NON|B|C|D|E|F|G|H|I|J|K|L|
M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z)\b/ /o',-1,NO_NUM_SCHAR);
cleaned_desc = COMPBL(STRIP(NO_STP_WRD));
DROP NO_NUM_SCHAR NO_STP_WRD;
run;
/* next step is for removing duplicate words*/
data cleaned(keep=Concatenated_Categories LOV_LONG_DSC cleaned_desc);
set cleaned;
newstring=scan(cleaned_desc, 1, ' ');
do i=2 to countw(cleaned_desc,' ');
word=scan(cleaned_desc, i, ' ');
found=find(newstring, word, 'it');
if found=0 then newstring=catx(' ', newstring, word);
end;
cleaned_desc= newstring;
DROP newstring;
run;
Input:
ATN1 (atrophin 1) (eg, dentatorubral-pallidoluysian atrophy) gene analysis, evaluation to detect abnormal (eg, expanded) alleles
My output:
ATN ATROPHIN EG DENTATORUBRAL PALLIDOLUYSIAN ATROPHY GENE ANALYSIS EVALUATION DETECT ABNORMAL EXPANDED ALLELES
Expected output:
ATROPHIN EG DENTATORUBRAL PALLIDOLUYSIAN ATROPHY GENE ANALYSIS EVALUATION DETECT ABNORMAL EXPANDED ALLELES
Also good to have:
I also want to remove any 2 letter words from the string such as 'EG' in this case.
Any guidance will be greatly appreciated.
Hi @mosabbirfardin ,
Your requirements need 3 steps.
In the first step anyone character in '()-,.' is replaced by a SPACE.
The second step will look for 2-character word to replace it by a SPACE.
The third step looks for a word ending in a NUMBER to replace it by a SPACE.
data _null_;
txt = 'ATN1 (atrophin 1) (eg, dentatorubral-pallidoluysian atrophy) gene analysis,
evaluation to detect abnormal (eg, expanded) alleles';
txt = translate(txt,' ','()-,.');
wc = countw(txt);
do i = 1 to wc;
word = scan(txt, i);
if length(word) = 2 then txt = transtrn(txt, strip(word), strip(' '));
else if anydigit(word) then txt = transtrn(txt, strip(word), strip(' '));
end;
txt = upcase(compbl(txt));
put txt =;
run;
Please post the data you have in usable form (data step with datalines) so that we know exactly what you have.
The following step remove all two-letter words:
data narf;
length Text $ 200;
input Text &;
output;
Text = prxchange('s/\b(\w\w)\b/ /', -1, Text);
output;
datalines;
If you use regular-expression-id, the PRXCHANGE function searches the variable source with the regular-expression-id that is returned by PRXPARSE.
It returns the value in source with the changes that were specified by the regular expression.
If there is no match, PRXCHANGE returns the unchanged value in source.
run;
Hi @mosabbirfardin ,
Your requirements need 3 steps.
In the first step anyone character in '()-,.' is replaced by a SPACE.
The second step will look for 2-character word to replace it by a SPACE.
The third step looks for a word ending in a NUMBER to replace it by a SPACE.
data _null_;
txt = 'ATN1 (atrophin 1) (eg, dentatorubral-pallidoluysian atrophy) gene analysis,
evaluation to detect abnormal (eg, expanded) alleles';
txt = translate(txt,' ','()-,.');
wc = countw(txt);
do i = 1 to wc;
word = scan(txt, i);
if length(word) = 2 then txt = transtrn(txt, strip(word), strip(' '));
else if anydigit(word) then txt = transtrn(txt, strip(word), strip(' '));
end;
txt = upcase(compbl(txt));
put txt =;
run;
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.