Currently, I am facing special characters issue, how shall I replace these special char with 1 space.
data cm0_;
cmtrt = "Metformin Hydrochloride Tablets"; output;
cmtrt = "Newdatarequired";output;
cmtrt = "";output;
run;
Requirement: Metformin Hydrochloride Tablets
Typically I just replace the strange things with spaces and then use COMPBL() to collapse the multiple spaces.
What do you consider "special"?
Perhaps anything that is not between a space and a tilde?
data test;
cmtrt = "Metformin Hydrochloride Tablets";
want = compbl(translate(cmtrt,' ',collate(0,31)||collate(127,255)));
put (_all_) (=/);
run;
Is that the only pattern in your dataset? If so, this works.
data want;
set cm0_;
cmtrt_2 = tranwrd(cmtrt, " ", " ");
run;
cmtrt cmtrt_2 Metformin Hydrochloride Tablets Metformin Hydrochloride Tablets Newdatarequired Newdatarequired
Hi, No this is hardcoding, I have other special chars also.
Typically I just replace the strange things with spaces and then use COMPBL() to collapse the multiple spaces.
What do you consider "special"?
Perhaps anything that is not between a space and a tilde?
data test;
cmtrt = "Metformin Hydrochloride Tablets";
want = compbl(translate(cmtrt,' ',collate(0,31)||collate(127,255)));
put (_all_) (=/);
run;
This is working:
data test; cmtrt = "Metformin Hydrochloride Tablets"; want = compbl(translate(cmtrt,' ',collate(127,255))); put (_all_) (=/); run;
I want to know more about this: what is happening when 0,31 and 127,255 considered?
collate(0,31)||collate(127,255)
In a single byte encoding there are only 256 possible codes from 0 to 255. The COLLATE() function just makes it easy to generate a series of characters by there code.
The ASCII code for a space is 32. Any byte less than that is a control character, like a tab or a linefeed.
The ASCII code for a tilde is 126. 127 is a DELETE character. Anything from 128 to 255 has the 8th bit and so is not part of the normal 7-bit ASCII codes. That is where all of the accented characters and other strange glyphs live.
Hello @pdhokriya,
This works for your example in a SAS session using WLATIN1 encoding ...
data want;
set cm0_;
cmtrt=kpropdata(cmtrt,' ','utf-8');
run;
... and suggests that the ' ' string originally was a UTF-8 character of some sort. Therefore, the same solution might work for other "unprintable" UTF-8 characters as well. You can specify 'wlatin1' explicitly as the fourth argument if needed (see documentation of the KPROPDATA function).
@pdhokriya wrote:
Hi , Thank you for reply, this does not work.
What happened when you tried it?
See how it works on my workstation:
1 proc options option=encoding;
2 run;
SAS (r) Proprietary Software Release 9.4 TS1M5
ENCODING=WLATIN1 Specifies the default character-set encoding for the SAS session.
NOTE: PROCEDURE OPTIONS used (Total process time):
real time 0.00 seconds
cpu time 0.01 seconds
3
4 data cm0_;
5 cmtrt = "Metformin Hydrochloride Tablets"; output;
6 cmtrt = "Newdatarequired";output;
7 cmtrt = "";output;
8 run;
NOTE: The data set WORK.CM0_ has 3 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 0.02 seconds
cpu time 0.03 seconds
9
10 data want;
11 set cm0_;
12 cmtrt=kpropdata(cmtrt,' ','utf-8');
13 run;
NOTE: There were 3 observations read from the data set WORK.CM0_.
NOTE: The data set WORK.WANT has 3 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
14
15 proc print data=want;
16 run;
NOTE: There were 3 observations read from the data set WORK.WANT.
NOTE: PROCEDURE PRINT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
17
18 proc print data=want;
19 format cmtrt $hex70.;
20 run;
Results:
Obs cmtrt 1 Metformin Hydrochloride Tablets 2 Newdatarequired 3
Obs cmtrt 1 4D6574666F726D696E20487964726F63686C6F72696465205461626C65747320202020 2 4E65776461746172657175697265642020202020202020202020202020202020202020 3 2020202020202020202020202020202020202020202020202020202020202020202020
The two '20'x characters highlighted above in obs. 1 are the blanks that replaced the original 3-byte UTF-8 characters.
Meanwhile I found out that ' ' = 'E38080'x is indeed a space character in UTF-8, called "ideographic space" (code point U+3000) and used in the context of Chinese, Japanese and Korean languages.
References: https://unicode.org/charts/nameslist/n_3000.html and (for the conversion between code points like U+3000 and hexadecimal UTF-8 codes like E38080) https://en.wikipedia.org/wiki/UTF-8
Another way:
data T;
OLD = "Metformin Hydrochloride Tablets";
NEW = compbl(prxchange('s/[^a-zA-Z]/ /',-1,OLD));
put (_ALL_) (=/);
run;
OLD=Metformin Hydrochloride Tablets
NEW=Metformin Hydrochloride Tablets
Note that   is UTF-8 for IDEOGRAPHIC SPACE
When reading an UTF-8 string, you should use UTF-8 encoding.
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.