Currently, I am facing special characters issue, how shall I replace these special char with 1 space.
data cm0_;
cmtrt = "Metformin Hydrochloride Tablets"; output;
cmtrt = "Newdatarequired";output;
cmtrt = "";output;
run;
Requirement: Metformin Hydrochloride Tablets
Typically I just replace the strange things with spaces and then use COMPBL() to collapse the multiple spaces.
What do you consider "special"?
Perhaps anything that is not between a space and a tilde?
data test;
cmtrt = "Metformin Hydrochloride Tablets";
want = compbl(translate(cmtrt,' ',collate(0,31)||collate(127,255)));
put (_all_) (=/);
run;
Is that the only pattern in your dataset? If so, this works.
data want;
set cm0_;
cmtrt_2 = tranwrd(cmtrt, " ", " ");
run;
cmtrt cmtrt_2 Metformin Hydrochloride Tablets Metformin Hydrochloride Tablets Newdatarequired Newdatarequired
Hi, No this is hardcoding, I have other special chars also.
Typically I just replace the strange things with spaces and then use COMPBL() to collapse the multiple spaces.
What do you consider "special"?
Perhaps anything that is not between a space and a tilde?
data test;
cmtrt = "Metformin Hydrochloride Tablets";
want = compbl(translate(cmtrt,' ',collate(0,31)||collate(127,255)));
put (_all_) (=/);
run;
This is working:
data test; cmtrt = "Metformin Hydrochloride Tablets"; want = compbl(translate(cmtrt,' ',collate(127,255))); put (_all_) (=/); run;
I want to know more about this: what is happening when 0,31 and 127,255 considered?
collate(0,31)||collate(127,255)
In a single byte encoding there are only 256 possible codes from 0 to 255. The COLLATE() function just makes it easy to generate a series of characters by there code.
The ASCII code for a space is 32. Any byte less than that is a control character, like a tab or a linefeed.
The ASCII code for a tilde is 126. 127 is a DELETE character. Anything from 128 to 255 has the 8th bit and so is not part of the normal 7-bit ASCII codes. That is where all of the accented characters and other strange glyphs live.
Hello @pdhokriya,
This works for your example in a SAS session using WLATIN1 encoding ...
data want;
set cm0_;
cmtrt=kpropdata(cmtrt,' ','utf-8');
run;
... and suggests that the ' ' string originally was a UTF-8 character of some sort. Therefore, the same solution might work for other "unprintable" UTF-8 characters as well. You can specify 'wlatin1' explicitly as the fourth argument if needed (see documentation of the KPROPDATA function).
@pdhokriya wrote:
Hi , Thank you for reply, this does not work.
What happened when you tried it?
See how it works on my workstation:
1 proc options option=encoding; 2 run; SAS (r) Proprietary Software Release 9.4 TS1M5 ENCODING=WLATIN1 Specifies the default character-set encoding for the SAS session. NOTE: PROCEDURE OPTIONS used (Total process time): real time 0.00 seconds cpu time 0.01 seconds 3 4 data cm0_; 5 cmtrt = "Metformin Hydrochloride Tablets"; output; 6 cmtrt = "Newdatarequired";output; 7 cmtrt = "";output; 8 run; NOTE: The data set WORK.CM0_ has 3 observations and 1 variables. NOTE: DATA statement used (Total process time): real time 0.02 seconds cpu time 0.03 seconds 9 10 data want; 11 set cm0_; 12 cmtrt=kpropdata(cmtrt,' ','utf-8'); 13 run; NOTE: There were 3 observations read from the data set WORK.CM0_. NOTE: The data set WORK.WANT has 3 observations and 1 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.01 seconds 14 15 proc print data=want; 16 run; NOTE: There were 3 observations read from the data set WORK.WANT. NOTE: PROCEDURE PRINT used (Total process time): real time 0.00 seconds cpu time 0.00 seconds 17 18 proc print data=want; 19 format cmtrt $hex70.; 20 run;
Results:
Obs cmtrt 1 Metformin Hydrochloride Tablets 2 Newdatarequired 3
Obs cmtrt 1 4D6574666F726D696E20487964726F63686C6F72696465205461626C65747320202020 2 4E65776461746172657175697265642020202020202020202020202020202020202020 3 2020202020202020202020202020202020202020202020202020202020202020202020
The two '20'x characters highlighted above in obs. 1 are the blanks that replaced the original 3-byte UTF-8 characters.
Meanwhile I found out that ' ' = 'E38080'x is indeed a space character in UTF-8, called "ideographic space" (code point U+3000) and used in the context of Chinese, Japanese and Korean languages.
References: https://unicode.org/charts/nameslist/n_3000.html and (for the conversion between code points like U+3000 and hexadecimal UTF-8 codes like E38080) https://en.wikipedia.org/wiki/UTF-8
Another way:
data T;
OLD = "Metformin Hydrochloride Tablets";
NEW = compbl(prxchange('s/[^a-zA-Z]/ /',-1,OLD));
put (_ALL_) (=/);
run;
OLD=Metformin Hydrochloride Tablets
NEW=Metformin Hydrochloride Tablets
Note that   is UTF-8 for IDEOGRAPHIC SPACE
When reading an UTF-8 string, you should use UTF-8 encoding.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.