Solved: Re: Replace special char by 1 space

pdhokriya · Posted 08-18-2021 11:35 AM

Currently, I am facing special characters issue, how shall I replace these special char with 1 space.

data cm0_;
cmtrt = "Metforminã€€Hydrochlorideã€€Tablets"; output;
cmtrt = "Newdatarequired";output;
cmtrt = "";output;
run;

Requirement: Metformin Hydrochloride Tablets

Tom · Posted 08-18-2021 12:10 PM

Typically I just replace the strange things with spaces and then use COMPBL() to collapse the multiple spaces.

What do you consider "special"?

Perhaps anything that is not between a space and a tilde?

data test;
cmtrt = "Metforminã€€Hydrochlorideã€€Tablets"; 
want = compbl(translate(cmtrt,' ',collate(0,31)||collate(127,255)));
put (_all_) (=/);
run;

View solution in original post

maguiremq · Posted 08-18-2021 11:52 AM

Is that the only pattern in your dataset? If so, this works.

data want;
	set cm0_;
		cmtrt_2 = tranwrd(cmtrt, "ã€€", " ");
run;

cmtrt 	                                                cmtrt_2
Metforminã€€Hydrochlorideã€€Tablets 	Metformin Hydrochloride Tablets
Newdatarequired 	                                Newdatarequired

pdhokriya · Posted 08-18-2021 11:19 PM

Hi, No this is hardcoding, I have other special chars also.

Tom · Posted 08-18-2021 12:10 PM

Typically I just replace the strange things with spaces and then use COMPBL() to collapse the multiple spaces.

What do you consider "special"?

Perhaps anything that is not between a space and a tilde?

data test;
cmtrt = "Metforminã€€Hydrochlorideã€€Tablets"; 
want = compbl(translate(cmtrt,' ',collate(0,31)||collate(127,255)));
put (_all_) (=/);
run;

pdhokriya · Posted 08-18-2021 11:24 PM

This is working:

data test; cmtrt = "Metforminã€€Hydrochlorideã€€Tablets"; want = compbl(translate(cmtrt,' ',collate(127,255))); put (_all_) (=/); run;

I want to know more about this: what is happening when 0,31 and 127,255 considered?

collate(0,31)||collate(127,255)

Tom · Posted 08-18-2021 11:59 PM

In a single byte encoding there are only 256 possible codes from 0 to 255. The COLLATE() function just makes it easy to generate a series of characters by there code.

The ASCII code for a space is 32. Any byte less than that is a control character, like a tab or a linefeed.

The ASCII code for a tilde is 126. 127 is a DELETE character. Anything from 128 to 255 has the 8th bit and so is not part of the normal 7-bit ASCII codes. That is where all of the accented characters and other strange glyphs live.

FreelanceReinh · Posted 08-18-2021 12:36 PM

Hello @pdhokriya,

This works for your example in a SAS session using WLATIN1 encoding ...

data want;
set cm0_;
cmtrt=kpropdata(cmtrt,' ','utf-8');
run;

... and suggests that the 'ã€€' string originally was a UTF-8 character of some sort. Therefore, the same solution might work for other "unprintable" UTF-8 characters as well. You can specify 'wlatin1' explicitly as the fourth argument if needed (see documentation of the KPROPDATA function).

pdhokriya · Posted 08-18-2021 11:27 PM

Hi , Thank you for reply, this does not work.

FreelanceReinh · Posted 08-19-2021 04:16 AM

@pdhokriya wrote:
Hi , Thank you for reply, this does not work.

What happened when you tried it?

See how it works on my workstation:

1    proc options option=encoding;
2    run;

    SAS (r) Proprietary Software Release 9.4  TS1M5

 ENCODING=WLATIN1  Specifies the default character-set encoding for the SAS session.
NOTE: PROCEDURE OPTIONS used (Total process time):
      real time           0.00 seconds
      cpu time            0.01 seconds


3
4    data cm0_;
5    cmtrt = "Metforminã€€Hydrochlorideã€€Tablets"; output;
6    cmtrt = "Newdatarequired";output;
7    cmtrt = "";output;
8    run;

NOTE: The data set WORK.CM0_ has 3 observations and 1 variables.
NOTE: DATA statement used (Total process time):
      real time           0.02 seconds
      cpu time            0.03 seconds


9
10   data want;
11   set cm0_;
12   cmtrt=kpropdata(cmtrt,' ','utf-8');
13   run;

NOTE: There were 3 observations read from the data set WORK.CM0_.
NOTE: The data set WORK.WANT has 3 observations and 1 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds


14
15   proc print data=want;
16   run;

NOTE: There were 3 observations read from the data set WORK.WANT.
NOTE: PROCEDURE PRINT used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


17
18   proc print data=want;
19   format cmtrt $hex70.;
20   run;

Results:

Obs    cmtrt

 1     Metformin Hydrochloride Tablets
 2     Newdatarequired
 3

Obs    cmtrt

 1     4D6574666F726D696E20487964726F63686C6F72696465205461626C65747320202020
 2     4E65776461746172657175697265642020202020202020202020202020202020202020
 3     2020202020202020202020202020202020202020202020202020202020202020202020

The two '20'x characters highlighted above in obs. 1 are the blanks that replaced the original 3-byte UTF-8 characters.

Meanwhile I found out that 'ã€€' = 'E38080'x is indeed a space character in UTF-8, called "ideographic space" (code point U+3000) and used in the context of Chinese, Japanese and Korean languages.

References: https://unicode.org/charts/nameslist/n_3000.html and (for the conversion between code points like U+3000 and hexadecimal UTF-8 codes like E38080) https://en.wikipedia.org/wiki/UTF-8

ChrisNZ · Posted 08-19-2021 06:21 AM

Another way:

data T;
  OLD = "Metforminã€€Hydrochlorideã€€Tablets";
  NEW = compbl(prxchange('s/[^a-zA-Z]/ /',-1,OLD));
  put (_ALL_) (=/);
run;

OLD=Metforminã€€Hydrochlorideã€€Tablets
NEW=Metformin Hydrochloride Tablets

Note that ã€€ is UTF-8 for IDEOGRAPHIC SPACE

When reading an UTF-8 string, you should use UTF-8 encoding.

High-Performance SAS Coding - Third Edition

pdhokriya · Posted 08-25-2021 07:07 AM

THank you for ur input

Registration is open

SAS Training: Just a Click Away