BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
pdhokriya
Pyrite | Level 9

Currently, I am facing special characters issue, how shall I replace these special char with 1 space.

 

data cm0_;
cmtrt = "Metformin Hydrochloride Tablets"; output;
cmtrt = "Newdatarequired";output;
cmtrt = "";output;
run;

 

Requirement: Metformin Hydrochloride Tablets

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Typically I just replace the strange things with spaces and then use COMPBL() to collapse the multiple spaces.

What do you consider "special"?

Perhaps anything that is not between a space and a tilde?

data test;
cmtrt = "Metformin Hydrochloride Tablets"; 
want = compbl(translate(cmtrt,' ',collate(0,31)||collate(127,255)));
put (_all_) (=/);
run;

 

View solution in original post

10 REPLIES 10
maguiremq
SAS Super FREQ

Is that the only pattern in your dataset? If so, this works.

data want;
	set cm0_;
		cmtrt_2 = tranwrd(cmtrt, " ", " ");
run;
cmtrt 	                                                cmtrt_2
Metformin Hydrochloride Tablets 	Metformin Hydrochloride Tablets
Newdatarequired 	                                Newdatarequired
  	 
pdhokriya
Pyrite | Level 9

Hi, No this is hardcoding, I have other special chars also.

Tom
Super User Tom
Super User

Typically I just replace the strange things with spaces and then use COMPBL() to collapse the multiple spaces.

What do you consider "special"?

Perhaps anything that is not between a space and a tilde?

data test;
cmtrt = "Metformin Hydrochloride Tablets"; 
want = compbl(translate(cmtrt,' ',collate(0,31)||collate(127,255)));
put (_all_) (=/);
run;

 

pdhokriya
Pyrite | Level 9

This is working:


data test; cmtrt = "Metformin Hydrochloride Tablets"; want = compbl(translate(cmtrt,' ',collate(127,255))); put (_all_) (=/); run;

 

I want to know more about this: what is happening when 0,31 and 127,255 considered?

 

collate(0,31)||collate(127,255)

Tom
Super User Tom
Super User

In a single byte encoding there are only 256 possible codes from 0 to 255. The COLLATE() function just makes it easy to generate a series of characters by there code.

 

The ASCII code for a space is 32.  Any byte less than that is a control character, like a tab or a linefeed.

The ASCII code for a tilde is 126.  127 is a DELETE character.  Anything from 128 to 255 has the 8th bit and so is not part of the normal 7-bit ASCII codes.  That is where all of the accented characters and other strange glyphs live.

 

FreelanceReinh
Jade | Level 19

Hello @pdhokriya,

 

This works for your example in a SAS session using WLATIN1 encoding  ...

data want;
set cm0_;
cmtrt=kpropdata(cmtrt,' ','utf-8');
run;

... and suggests that the ' ' string originally was a UTF-8 character of some sort. Therefore, the same solution might work for other "unprintable" UTF-8 characters as well. You can specify 'wlatin1' explicitly as the fourth argument if needed (see documentation of the KPROPDATA function).

pdhokriya
Pyrite | Level 9
Hi , Thank you for reply, this does not work.
FreelanceReinh
Jade | Level 19

@pdhokriya wrote:
Hi , Thank you for reply, this does not work.

What happened when you tried it?

 

See how it works on my workstation:

1    proc options option=encoding;
2    run;

    SAS (r) Proprietary Software Release 9.4  TS1M5

 ENCODING=WLATIN1  Specifies the default character-set encoding for the SAS session.
NOTE: PROCEDURE OPTIONS used (Total process time):
      real time           0.00 seconds
      cpu time            0.01 seconds


3
4    data cm0_;
5    cmtrt = "Metformin Hydrochloride Tablets"; output;
6    cmtrt = "Newdatarequired";output;
7    cmtrt = "";output;
8    run;

NOTE: The data set WORK.CM0_ has 3 observations and 1 variables.
NOTE: DATA statement used (Total process time):
      real time           0.02 seconds
      cpu time            0.03 seconds


9
10   data want;
11   set cm0_;
12   cmtrt=kpropdata(cmtrt,' ','utf-8');
13   run;

NOTE: There were 3 observations read from the data set WORK.CM0_.
NOTE: The data set WORK.WANT has 3 observations and 1 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds


14
15   proc print data=want;
16   run;

NOTE: There were 3 observations read from the data set WORK.WANT.
NOTE: PROCEDURE PRINT used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


17
18   proc print data=want;
19   format cmtrt $hex70.;
20   run;

Results:

Obs    cmtrt

 1     Metformin Hydrochloride Tablets
 2     Newdatarequired
 3
Obs    cmtrt

 1     4D6574666F726D696E20487964726F63686C6F72696465205461626C65747320202020
 2     4E65776461746172657175697265642020202020202020202020202020202020202020
 3     2020202020202020202020202020202020202020202020202020202020202020202020

The two '20'x characters highlighted above in obs. 1 are the blanks that replaced the original 3-byte UTF-8 characters.

 

Meanwhile I found out that ' ' = 'E38080'x is indeed a space character in UTF-8, called "ideographic space" (code point U+3000) and used in the context of Chinese, Japanese and Korean languages.

References: https://unicode.org/charts/nameslist/n_3000.html and (for the conversion between code points like U+3000 and hexadecimal UTF-8 codes like E38080) https://en.wikipedia.org/wiki/UTF-8

ChrisNZ
Tourmaline | Level 20

Another way:

data T;
  OLD = "Metformin Hydrochloride Tablets";
  NEW = compbl(prxchange('s/[^a-zA-Z]/ /',-1,OLD));
  put (_ALL_) (=/);
run;

OLD=Metformin Hydrochloride Tablets
NEW=Metformin Hydrochloride Tablets


Note that ã€€ is UTF-8 for IDEOGRAPHIC SPACE

When reading an UTF-8 string, you should use UTF-8 encoding.

pdhokriya
Pyrite | Level 9
THank you for ur input

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 3252 views
  • 8 likes
  • 5 in conversation