Solved: Removing non Unicode characters from a variable

Shayan2012 · Posted 03-22-2017 10:48 AM

Hello Everyone,

The title might not be accurate since I am not familiar with encoding, but here is my problem in simple words: I have a variable which is actually a list of names of people. Apparently, some of these names are Spanish or French, so they have characters which I belive are called "hexadecimal characters", such as E with an accent above it, or a lowercase i with umlaut above it. ( I dont know how to type them, some examples are attached in the picture.)

I want to convert all of them into regular characters, for example, E with dots into E, etc.

I thought compress function should be the right way, so first I tried to just keep the alphabets like this:

data test2;
   set test;
   names_translate = compress(name2,'','ka');
run;

It does not work unfortunately, and those charachters remain there. I played with other modifiers, such as 'c' or 'w' but those do not seem to give me what I want either. I was wondering if there is a neat method with compress function, or any other function that gives me the desired result? In the picture below I have shown basically what I have and what I want to get as output.

ballardw · Posted 03-22-2017 11:16 AM

The function you are going to want is TRANSLATE. The characters are more likely to be "high order ASCII" or similar which are representations of ASCII values greater than 126.

The data set may help:

data work.highorderascii;
   do i= 127 to 255;
      char = byte(i);
      output;
   end;
run;

Here is an example using translate function that may work for you.

data example;
   x='Andrè';
   y=translate(x,'AAAAAAACEEEEIIIIDNOOOOO OUUUUY Saaaaaaaceeeeiiiidnooooo ouuuuy y',
                 'ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ');
run;

The value in the first long string replaces the corresponding value in the second string, which is why I show them one over the other above. The comparison is case sensitive and I have used what I believe to be the common replace for most of those going into English. If you need a different rule it should be easy to manipulate.

View solution in original post

ballardw · Posted 03-22-2017 11:16 AM

The function you are going to want is TRANSLATE. The characters are more likely to be "high order ASCII" or similar which are representations of ASCII values greater than 126.

The data set may help:

data work.highorderascii;
   do i= 127 to 255;
      char = byte(i);
      output;
   end;
run;

Here is an example using translate function that may work for you.

data example;
   x='Andrè';
   y=translate(x,'AAAAAAACEEEEIIIIDNOOOOO OUUUUY Saaaaaaaceeeeiiiidnooooo ouuuuy y',
                 'ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ');
run;

The value in the first long string replaces the corresponding value in the second string, which is why I show them one over the other above. The comparison is case sensitive and I have used what I believe to be the common replace for most of those going into English. If you need a different rule it should be easy to manipulate.

Shayan2012 · Posted 03-22-2017 03:22 PM

Thanks a lot, ballardw. That is exactly what I was looking for!

Removing non Unicode characters from a variable

Re: Removing non Unicode characters from a variable

Re: Removing non Unicode characters from a variable

Re: Removing non Unicode characters from a variable

Removing non Unicode characters from a variable

Re: Removing non Unicode characters from a variable

Re: Removing non Unicode characters from a variable

Re: Removing non Unicode characters from a variable

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away