BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
cbo
Fluorite | Level 6 cbo
Fluorite | Level 6

Hello,

 

We have issues dealing with ANSI to UTF8 encoding mishaps in our migration from SAS 9.2 to 9.4 .
Indeed the new encoding of data generates the special character � instead of the french punctuation (é, è, ê, ë, ...).

 

Questions on this site mention using `prxchange` or `tranwrd` function to fix this problem. While this work with regular encoding it appears to not work with the � character when sourced from the production environnement (into the work library). Can someone please advise me how to fix this ?

 

Here is a reproducible example that does work :

```
data df_mishap2 ;

    input var $40. ;

    datalines;

abcdef abcd�a ab abcde 
XXXXXXXX 
OOOOOO abcdefsdkgtre ;

run ;

 

data df_clean2 ;

    set df_mishap2 ;

    var2 = prxchange("s/�/e/i", -1, var) ; /* ok */

    var3 = tranwrd(var, "�", "e") ; /* ok */

run;
```

But weirdly when applied to a sample dataset sourced from production into the work library it fails :
```
data work.test ;
    set libprod.table (keep = var obs = 100);
    var2 = var;
    var3 = var;
    var2 = prxchange("s/�/e/i", -1, var2) ;
    var3 = TRANWRD(var3, "�", "e") ;
run;
```

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

That byte is not a valid UTF-8 code.  Looks like a LATIN1 code.

Try using the KCVT function to convert the strings to UTF-8 codes.  Make sure they are long enough as some characters require up to 4 bytes to be represented in UTF-8.

data test;
  length have want $4 ;
  have = 'E9'x ;
  want = kcvt(have,'latin1','utf8');
  put (2*have 2*want) (+1 = $4. +1 $hex.) ;
run;

Results in a UTF-8 session:

image.png

Results in a LATIN1 session.

image.png

View solution in original post

6 REPLIES 6
Tom
Super User Tom
Super User

What encoding is your current SAS session using? Check the value of the ENCODING system option.

Are you reading from SAS datasets?  If so what encoding is the dataset using?  Check the PROC CONTENTS output.

Or text files? If so what encoding is the text file using? Does it have a BOM? What encoding did you tell SAS to use when reading it?

Or using in-line data in your programs, like in your example?

cbo
Fluorite | Level 6 cbo
Fluorite | Level 6

Hi Tom,

 

Thank you for your answer.

 

1) My current SAS session is UTF-8 encoded.

2) All datasets are stored in production libraries which use utf-8 Unicode (UTF-8) encoding.

 

From what I understood the issue occures at the encoding level :

 

S ENTR�E DE GAMME /* what is displayed in SAS */ 
0012280: 5320 454e 5452 4545 2044 4520 4741 4d4d S ENTREE DE GAMM /* obtained with MobaXterm $ xxd table.sas7bdat | more */

 

 

Tom
Super User Tom
Super User

So the data is wrong in the dataset then.  You should probably open a support ticket with SAS.

 

That diamond question mark character is what SAS replaces characters that cannot be transcoded with.  The question is whether the transcoding error occurred when the dataset was created, or when you are trying to read it.

 

Try reading the existing dataset with ENCODING=ANY and see if you can tell what is in that location by using the $HEX format to see what codes are actually stored.

cbo
Fluorite | Level 6 cbo
Fluorite | Level 6

You should probably open a support ticket with SAS.

It has been done, my supervisors are not too happy about their answer.

You are spot on with the use of the $HEX format :

 

data df_mishap4 ;
	input var $hex64. ;
	datalines ;
5320454E5452E9452044452047414D4D452020202020202020202020 /* for the sake of the argument I have putted the hexadecimal encoding to reproduce the error */
;;;
run;
/* If you look up the data � appears so you need to check what character causes this */
data test; set df_mishap4; if _n_=1 then do; put var $32.; put var $hex64.; end; run;

which gives in the log :

 

S ENTR�E DE GAMME             
 5320 454E 5452 E945 2044452047414D4D452020202020202020202020

 

 

That way I could find that the problem here has ut8 unicode "E9" (which correspond to the symbol � instead of é ).
Then I could apply the correction to change the encoding (with \xe9) to E :

data clean;
	set df_mishap4;
	var2 = prxchange("s/\xe9/E/i", -1, var) ;
run;

Now I just wish there was an automatic way to fix all variables of a table to a set of transformations (all the other special character).

 

Thank you for your help !

 

Tom
Super User Tom
Super User

That byte is not a valid UTF-8 code.  Looks like a LATIN1 code.

Try using the KCVT function to convert the strings to UTF-8 codes.  Make sure they are long enough as some characters require up to 4 bytes to be represented in UTF-8.

data test;
  length have want $4 ;
  have = 'E9'x ;
  want = kcvt(have,'latin1','utf8');
  put (2*have 2*want) (+1 = $4. +1 $hex.) ;
run;

Results in a UTF-8 session:

image.png

Results in a LATIN1 session.

image.png

cbo
Fluorite | Level 6 cbo
Fluorite | Level 6
Thank you for this handy advice ! That is the closest to the solution that built in function will get us it seems.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 7673 views
  • 1 like
  • 2 in conversation