Currently, I am facing special characters issue, how shall I replace these special char with 1 space.
data cm0_;
cmtrt = "Metformin Hydrochloride Tablets"; output;
cmtrt = "Newdatarequired";output;
cmtrt = "";output;
run;
Requirement: Metformin Hydrochloride Tablets
Typically I just replace the strange things with spaces and then use COMPBL() to collapse the multiple spaces.
What do you consider "special"?
Perhaps anything that is not between a space and a tilde?
data test;
cmtrt = "Metformin Hydrochloride Tablets";
want = compbl(translate(cmtrt,' ',collate(0,31)||collate(127,255)));
put (_all_) (=/);
run;
Is that the only pattern in your dataset? If so, this works.
data want;
set cm0_;
cmtrt_2 = tranwrd(cmtrt, " ", " ");
run;
cmtrt cmtrt_2 Metformin Hydrochloride Tablets Metformin Hydrochloride Tablets Newdatarequired Newdatarequired
Hi, No this is hardcoding, I have other special chars also.
Typically I just replace the strange things with spaces and then use COMPBL() to collapse the multiple spaces.
What do you consider "special"?
Perhaps anything that is not between a space and a tilde?
data test;
cmtrt = "Metformin Hydrochloride Tablets";
want = compbl(translate(cmtrt,' ',collate(0,31)||collate(127,255)));
put (_all_) (=/);
run;
This is working:
data test; cmtrt = "Metformin Hydrochloride Tablets"; want = compbl(translate(cmtrt,' ',collate(127,255))); put (_all_) (=/); run;
I want to know more about this: what is happening when 0,31 and 127,255 considered?
collate(0,31)||collate(127,255)
In a single byte encoding there are only 256 possible codes from 0 to 255. The COLLATE() function just makes it easy to generate a series of characters by there code.
The ASCII code for a space is 32. Any byte less than that is a control character, like a tab or a linefeed.
The ASCII code for a tilde is 126. 127 is a DELETE character. Anything from 128 to 255 has the 8th bit and so is not part of the normal 7-bit ASCII codes. That is where all of the accented characters and other strange glyphs live.
Hello @pdhokriya,
This works for your example in a SAS session using WLATIN1 encoding ...
data want;
set cm0_;
cmtrt=kpropdata(cmtrt,' ','utf-8');
run;
... and suggests that the ' ' string originally was a UTF-8 character of some sort. Therefore, the same solution might work for other "unprintable" UTF-8 characters as well. You can specify 'wlatin1' explicitly as the fourth argument if needed (see documentation of the KPROPDATA function).
@pdhokriya wrote:
Hi , Thank you for reply, this does not work.
What happened when you tried it?
See how it works on my workstation:
1 proc options option=encoding; 2 run; SAS (r) Proprietary Software Release 9.4 TS1M5 ENCODING=WLATIN1 Specifies the default character-set encoding for the SAS session. NOTE: PROCEDURE OPTIONS used (Total process time): real time 0.00 seconds cpu time 0.01 seconds 3 4 data cm0_; 5 cmtrt = "Metformin Hydrochloride Tablets"; output; 6 cmtrt = "Newdatarequired";output; 7 cmtrt = "";output; 8 run; NOTE: The data set WORK.CM0_ has 3 observations and 1 variables. NOTE: DATA statement used (Total process time): real time 0.02 seconds cpu time 0.03 seconds 9 10 data want; 11 set cm0_; 12 cmtrt=kpropdata(cmtrt,' ','utf-8'); 13 run; NOTE: There were 3 observations read from the data set WORK.CM0_. NOTE: The data set WORK.WANT has 3 observations and 1 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.01 seconds 14 15 proc print data=want; 16 run; NOTE: There were 3 observations read from the data set WORK.WANT. NOTE: PROCEDURE PRINT used (Total process time): real time 0.00 seconds cpu time 0.00 seconds 17 18 proc print data=want; 19 format cmtrt $hex70.; 20 run;
Results:
Obs cmtrt 1 Metformin Hydrochloride Tablets 2 Newdatarequired 3
Obs cmtrt 1 4D6574666F726D696E20487964726F63686C6F72696465205461626C65747320202020 2 4E65776461746172657175697265642020202020202020202020202020202020202020 3 2020202020202020202020202020202020202020202020202020202020202020202020
The two '20'x characters highlighted above in obs. 1 are the blanks that replaced the original 3-byte UTF-8 characters.
Meanwhile I found out that ' ' = 'E38080'x is indeed a space character in UTF-8, called "ideographic space" (code point U+3000) and used in the context of Chinese, Japanese and Korean languages.
References: https://unicode.org/charts/nameslist/n_3000.html and (for the conversion between code points like U+3000 and hexadecimal UTF-8 codes like E38080) https://en.wikipedia.org/wiki/UTF-8
Another way:
data T;
OLD = "Metformin Hydrochloride Tablets";
NEW = compbl(prxchange('s/[^a-zA-Z]/ /',-1,OLD));
put (_ALL_) (=/);
run;
OLD=Metformin Hydrochloride Tablets
NEW=Metformin Hydrochloride Tablets
Note that   is UTF-8 for IDEOGRAPHIC SPACE
When reading an UTF-8 string, you should use UTF-8 encoding.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.