DATA Step, Macro, Functions and more

Removing non Unicode characters from a variable

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 75
Accepted Solution

Removing non Unicode characters from a variable

Hello Everyone, 

 

The title might not be accurate since I am not familiar with encoding, but here is my problem in simple words: I have a variable which is actually a list of names of people. Apparently, some of these names are Spanish or French, so they have characters which I belive are called "hexadecimal characters", such as  E with an accent above it, or a lowercase i with umlaut above it. ( I dont know how to type them, some examples are attached in the picture.) 

 

I want to convert all of them into regular characters, for example, E with dots into E, etc. 

 

I thought compress function should be the right way, so first I tried to just keep the alphabets like this:

 

data test2;
   set test;
   names_translate = compress(name2,'','ka');
run;

It does not work unfortunately, and those charachters remain there. I played with other modifiers, such as 'c' or 'w' but those do not seem to give me what I want either. I was wondering if there is a neat method with compress function, or any other function that gives me the desired result? In the picture below I have shown basically what I have and what I want to get as output.

 

Example

 

 


Accepted Solutions
Solution
‎03-22-2017 03:22 PM
Super User
Posts: 11,343

Re: Removing non Unicode characters from a variable

Posted in reply to Shayan2012

The function you are going to want is TRANSLATE. The characters are more likely to be "high order ASCII" or similar which are representations of ASCII values greater than 126.

The data set may help:

data work.highorderascii;
   do i= 127 to 255;
      char = byte(i);
      output;
   end;
run;

Here is an example using translate function that may work for you.

 

data example;
   x='Andrè';
   y=translate(x,'AAAAAAACEEEEIIIIDNOOOOO OUUUUY Saaaaaaaceeeeiiiidnooooo ouuuuy y',
                 'ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ');
run;

The value in the first long string replaces the corresponding value in the second string, which is why I show them one over the other above. The comparison is case sensitive and I have used what I believe to be the common replace for most of those going into English. If you need a different rule it should be easy to manipulate.

 

View solution in original post


All Replies
Solution
‎03-22-2017 03:22 PM
Super User
Posts: 11,343

Re: Removing non Unicode characters from a variable

Posted in reply to Shayan2012

The function you are going to want is TRANSLATE. The characters are more likely to be "high order ASCII" or similar which are representations of ASCII values greater than 126.

The data set may help:

data work.highorderascii;
   do i= 127 to 255;
      char = byte(i);
      output;
   end;
run;

Here is an example using translate function that may work for you.

 

data example;
   x='Andrè';
   y=translate(x,'AAAAAAACEEEEIIIIDNOOOOO OUUUUY Saaaaaaaceeeeiiiidnooooo ouuuuy y',
                 'ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ');
run;

The value in the first long string replaces the corresponding value in the second string, which is why I show them one over the other above. The comparison is case sensitive and I have used what I believe to be the common replace for most of those going into English. If you need a different rule it should be easy to manipulate.

 

Frequent Contributor
Posts: 75

Re: Removing non Unicode characters from a variable

Thanks a lot, ballardw. That is exactly what I was looking for!
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 2 replies
  • 405 views
  • 1 like
  • 2 in conversation