Solved: Re: is there any code to replace the accented characters with non-acce...

Alexxxxxxx · Posted 03-01-2019 10:07 AM

Dear all,

is there any code to replace the accented characters with non-accented characters in the variable? for example, u umlaut becomes 'ue', or 'é' becomes 'e'.

thanks in advance.

PGStats · Posted 03-01-2019 06:03 PM

This would be a bit more efficient:

data want;
set have;
c = tranwrd(c, 'ß', 'SS');
c = prxChange("s/([ÄÆÖÜ])/\1E/o", -1, c);
c = prxChange("s/([äæöü])/\1e/o", -1, c);
*c = basechar(c);
c = translate(c,
    'AAAAAAACEEEEIIIIDNOOOOOOUUUUYaaaaaaaceeeeiiiionoooooouuuuyy',
    'ÀÁÂÃÅÄÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕØÖÙÚÛÜÝàáâãåäæçèéêëìíîïðñòóôõøöùúûüýÿ');
run;

PG

View solution in original post

ballardw · Posted 03-01-2019 01:05 PM

If specific letters are the concern TRANSLATE will work:

data example;
   x='abcdé';
   y=translate(x,'e','é');
run;

If you don't know of all the likely culprits then BASECHAR may work but I'm not sure of an umlaut to 2-character as you desire.

FreelanceReinh · Posted 03-01-2019 01:57 PM

Hi @Alexxxxxxx,

For the 2-character replacements you can use TRANWRD. Unlike TRANSLATE it doesn't allow for multiple "from-to" pairs in the same function call, so you may want to use a loop.

Example:

data have;
length c $20;
input c;
cards;
Ägypten
Österreich
äußerst
Übung
müßig
Römer
;

data want;
array f[7] $1 _temporary_ ('ä'  'ö'  'ü'  'ß'  'Ä'  'Ö'  'Ü' );
array t[7] $2 _temporary_ ('ae' 'oe' 'ue' 'ss' 'Ae' 'Oe' 'Ue');
set have;
do _n_=1 to dim(f);
  c=tranwrd(c, f[_n_], t[_n_]);
end;
run;

Make sure that the length of the target variable is sufficient to accommodate the strings after the replacement(s).

Alexxxxxxx · Posted 03-01-2019 02:26 PM

Dear FreelanceReinhard,

thanks for your helpful advice.

Can I use this code to run 1-character replacement as well? for example 'à' becomes 'a'?

FreelanceReinh · Posted 03-01-2019 02:49 PM

@Alexxxxxxx wrote:

Dear FreelanceReinhard,

thanks for your helpful advice.

Can I use this code to run 1-character replacement as well? for example 'à' becomes 'a'?

Of course, the target values in TRANWRD can be single characters as well, but if you add them to the existing $2 array, you'll insert an unwanted trailing blank into the target string. Therefore, I think the code using TRANSLATE (with multiple "from-to" pairs) will be shorter, e.g. (incomplete example)

c=translate(c,'aceee','àçéêè');

Alexxxxxxx · Posted 03-01-2019 05:09 PM

So, can I use the following codes?

data want;
array f[9] $1 _temporary_ ('Ä'  'Æ'  'Ö'  'Ü'  'ß'  'ä'  'æ'  'ö'  'ü');
array t[9] $2 _temporary_ ('AE' 'AE' 'OE' 'UE' 'SS' 'ae' 'ae' 'oe' 'ue');
set have;
do _n_=1 to dim(f);
  c=tranwrd(c, f[_n_], t[_n_]);
  c=translate(c,'AAAAACEEEEIIIIDNOOOOOUUUYaaaaaceeeeiiiionooooouuuyy','ÀÁÂÃÅÇÈÉÊËÌÍÎÏÐÑÒÓÔÕØÙÚÛÝàáâãåçèéêëìíîïðñòóôõøùúûýÿ');
end;
run;

PGStats · Posted 03-01-2019 06:03 PM

This would be a bit more efficient:

data want;
set have;
c = tranwrd(c, 'ß', 'SS');
c = prxChange("s/([ÄÆÖÜ])/\1E/o", -1, c);
c = prxChange("s/([äæöü])/\1e/o", -1, c);
*c = basechar(c);
c = translate(c,
    'AAAAAAACEEEEIIIIDNOOOOOOUUUUYaaaaaaaceeeeiiiionoooooouuuuyy',
    'ÀÁÂÃÅÄÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕØÖÙÚÛÜÝàáâãåäæçèéêëìíîïðñòóôõøöùúûüýÿ');
run;

PG

Alexxxxxxx · Posted 03-02-2019 09:02 AM

Dear PG,

thanks for your advice.

However, I get the following file by running the codes,

data have;
length c $20;
input c;
cards;
Ägypten
Österreich
äußerst
Übung
müßig
Römer
Pierre
Étirer
Jean-Pierre
Ägypten
Österreich
äußerst
Übung
müßig
Römer
;
run;

data want;
set have;
c = tranwrd(c, 'ß', 'SS');
c = prxChange("s/([ÄÆÖÜ])/\1E/o", -1, c);
c = prxChange("s/([äæöü])/\1e/o", -1, c);
*c = basechar(c);
c = translate(c,'AAAAACEEEEIIIIDNOOOOOUUUYaaaaaceeeeiiiidnooooouuuyy','ÀÁÂÃÅÇÈÉÊËÌÍÎÏÐÑÒÓÔÕØÙÚÛÝàáâãåçèéêëìíîïðñòóôõøùúûýÿ');

run;

The SAS System

c

ÄEgypten

ÖEsterreich

äeuSSerst

ÜEbung

müeSSig

Röemer

Pierre

Etirer

Jean-Pierre

ÄEgypten

ÖEsterreich

äeuSSerst

ÜEbung

müeSSig

Röemer

FreelanceReinh · Posted 03-01-2019 06:04 PM

@Alexxxxxxx wrote:

So, can I use the following codes?

data want;
array f[9] $1 _temporary_ ('Ä'  'Æ'  'Ö'  'Ü'  'ß'  'ä'  'æ'  'ö'  'ü');
array t[9] $2 _temporary_ ('AE' 'AE' 'OE' 'UE' 'SS' 'ae' 'ae' 'oe' 'ue');
set have;
do _n_=1 to dim(f);
  c=tranwrd(c, f[_n_], t[_n_]);
  c=translate(c,'AAAAACEEEEIIIIDNOOOOOUUUYaaaaaceeeeiiiionooooouuuyy','ÀÁÂÃÅÇÈÉÊËÌÍÎÏÐÑÒÓÔÕØÙÚÛÝàáâãåçèéêëìíîïðñòóôõøùúûýÿ');
end;
run;

The TRANSLATE call doesn't need to be repeated (nine times). It should occur outside of the DO loop.
The capital double S is not an ideal replacement for 'ß' except in words written in capitals (which rarely contain an 'ß').
Similarly, "Aegypten", "Oesterreich" etc. would be preferable to "AEgypten", "OEsterreich" etc. -- if the results are used for text output (like a report).
Most of the 1-character replacements could be accomplished with the elegant BASECHAR function, which ballardw and PGStats have suggested. The few exceptions might be questionable in your code anyway. For example, why should 'ð' (a kind of 'd' I think) be replaced by 'o'?
Of course, character variable C must be in dataset HAVE, with sufficient length.

PGStats · Posted 03-01-2019 10:41 PM

So, to summarize comments above, a reasonable code could be simplified to :

data want;
set have;
c = tranwrd(c, 'ß', 'ss');
c = prxChange("s/([äæöüÄÆÖÜ])/\1e/o", -1, c);
c = basechar(c);
run;

Or

data want;
set have;
c = basechar(prxChange("s/([äæöüÄÆÖÜ])/\1e/o", -1, tranwrd(c, 'ß', 'ss')));
run;

PG

Tom · Posted 03-02-2019 12:54 PM

This is probably the easiest solution to get working, although it requires that you create the list of mappings.

Note make sure to make the temporary variables long enough to hold the Unicode representations of the strings. Many could take up to 4 bytes.

data have;
  input c $40.;
cards;
Ägypten
Österreich
äußerst
Übung
müßig
Römer
Pierre
Étirer
Jean-Pierre
Ägypten
Österreich
äußerst
Übung
müßig
Römer
;

data want;
  array f[8] $4 _temporary_ ('ä'  'ö'  'ü'  'ß'  'Ä'  'Ö'  'Ü'  'É');
  array t[8] $4 _temporary_ ('ae' 'oe' 'ue' 'ss' 'Ae' 'Oe' 'Ue' 'E');
  set have;
  d=c;
  do _n_=1 to dim(f);
    d=tranwrd(d, trim(f[_n_]), trim(t[_n_]));
  end;
run;

proc print;
run;

You could also put your translation pairs into a dataset instead of placing it in the code.

data translate;
  array f[8] $4 _temporary_ ('ä'  'ö'  'ü'  'ß'  'Ä'  'Ö'  'Ü'  'É');
  array t[8] $4 _temporary_ ('ae' 'oe' 'ue' 'ss' 'Ae' 'Oe' 'Ue' 'E');
  do _n_=1 to dim(f);
     from=f[_n_];
     to=t[_n_];
     output;
  end;
run;

data want;
  set have;
  d=c;
  do p=1 to nobs;
    set translate point=p nobs=nobs;
    d=tranwrd(d, trim(from), trim(to));
  end;
  drop from to;
run;

PGStats · Posted 03-01-2019 03:09 PM

If you have access to NLS (don't know if it requires a separate licence anymore) you can use function BASECHAR. Try :

data _null_;
input mot :$12.;
key = basechar(mot);
put key;
datalines;
Pierre
Étirer
Jean-Pierre
Ägypten
Österreich
äußerst
Übung
müßig
Römer
;

PG

Patrick · Posted 03-01-2019 07:36 PM

@PGStats

That's a nice function. I believe NLS comes as part of Foundation SAS.

Looking at the result: It does the 1:1 translation but it doesn't really work for a German Umlaut. Correctly a ...

Ä

...should get converted into Ae

is there any code to replace the accented characters with non-accented characters in the variable?

Re: is there any code to replace the accented characters with non-accented characters in the variabl

Re: is there any code to replace the accented characters with non-accented characters in the variabl

Re: is there any code to replace the accented characters with non-accented characters in the variabl

Re: is there any code to replace the accented characters with non-accented characters in the variabl

Re: is there any code to replace the accented characters with non-accented characters in the variabl

Re: is there any code to replace the accented characters with non-accented characters in the variabl

Re: is there any code to replace the accented characters with non-accented characters in the variabl

Re: is there any code to replace the accented characters with non-accented characters in the variabl

Re: is there any code to replace the accented characters with non-accented characters in the variabl

Re: is there any code to replace the accented characters with non-accented characters in the variabl

Re: is there any code to replace the accented characters with non-accented characters in the variabl

Re: is there any code to replace the accented characters with non-accented characters in the variabl

Re: is there any code to replace the accented characters with non-accented characters in the variabl

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!