Solved: Table created in UTF-8 session still Latin1

rtbuttram · Posted 04-23-2021 04:26 PM

This is more of a curiosity question than a problem.

I have a table that gets created every night in our SAS Grid environment using a data step like the following:

data lib.tabl;
     length type $ 15;
     length descript $ 9;
     input type $ descript $;
     infile datalines delimiter=',';
     datalines;
<<imagine datalines here>>
;

This code has been running for over a year with no issues.

Recently our SAS environment switched from Latin1 to UTF-8 session encoding. I’ve noticed the table created by the above code still shows “latin1 Western (ISO)” as the Encoding scheme in PROC CONTENTS. I would have expected the encoding to change to UTF-8 once our environment session encoding was changed.

I’ve tried to reproduce this behavior by intentionally creating a table with Latin1 encoding and then replacing its contents with a data step such as the above, but the result is always a table with UTF-8 encoding.

Does anyone have any idea why this older table remains in Latin1 after the change to our session encoding? Again, more a curiosity question than a problem, as I have no need to store mutli-byte characters is this table.

Thanks.

rtbuttram · Posted 04-24-2021 11:05 AM

Ok. I *think* I've got a handle on this.

Earlier I said I had tried to reproduce the behavior by intentionally creating a table in latin1 and then trying to recreate it to see if it remained latin1. That wasn't exactly what I did. I actually uploaded a latin1 dataset that had been created on a Windows host, and then attempted to recreate it in a Linux environment. That reliably results in a new table being created with utf-8 encoding.

If I create a table in my Linux session with latin1 encoding by using the encoding= data step option, and then replace that table in a separate data step without specifying an encoding option, SAS recognizes the encoding of the pre-existing dataset and uses CEDA to transcode the data from utf-8 to latin1, recreating the table in its original latin1 encoding. Interestingly, the structure of the new dataset can be entirely different from that of the pre-existing dataset. The key seems to be the encoding attribute of pre-existing dataset, and the fact that the pre-existing dataset was created using a data representation that is CEDA compatible.

I found the following CEDA documentation that seems to address this phenomen:

SAS Help Center: SAS File Processing with CEDA

Thanks to everyone for their input.

Bob

View solution in original post

ballardw · Posted 04-23-2021 05:14 PM

If you haven't rerun the code to rebuild the table it would maintain the encoding it had when created. It isn't clear whether you have actually rerun that data step to create the table again.

rtbuttram · Posted 04-23-2021 05:41 PM

Thanks for the reply.

The code runs every night as part of a scheduled batch job. PROC CONTENTS shows me that the table was created within the past 24 hours, and yet it remains Latin1. Could the system user that’s running the code be accessing SAS with a Latin1 session encoding?

I should probably just delete the table, in which case it would probably get created as UTF-8 on the next run, but it drives me nuts when the system is behaving in an unexpected manner and I can’t figure out why.

ChrisNZ · Posted 04-23-2021 10:38 PM

> Could the system user that’s running the code be accessing SAS with a Latin1 session encoding?

That's the first thing I'd look at: What's the configuration used by this batch job?

High-Performance SAS Coding - Third Edition

Patrick · Posted 04-23-2021 09:14 PM

From what you describe it looks like "something" is overwriting the session encoding. It could be the dataset encoding option or the libname outencoding option.

Ksharp · Posted 04-24-2021 06:17 AM

Yeah. I meet the same problem too . Try option:

data lib.tabl(encoding='utf8') ;

rtbuttram · Posted 04-24-2021 11:05 AM

Ok. I *think* I've got a handle on this.

Earlier I said I had tried to reproduce the behavior by intentionally creating a table in latin1 and then trying to recreate it to see if it remained latin1. That wasn't exactly what I did. I actually uploaded a latin1 dataset that had been created on a Windows host, and then attempted to recreate it in a Linux environment. That reliably results in a new table being created with utf-8 encoding.

If I create a table in my Linux session with latin1 encoding by using the encoding= data step option, and then replace that table in a separate data step without specifying an encoding option, SAS recognizes the encoding of the pre-existing dataset and uses CEDA to transcode the data from utf-8 to latin1, recreating the table in its original latin1 encoding. Interestingly, the structure of the new dataset can be entirely different from that of the pre-existing dataset. The key seems to be the encoding attribute of pre-existing dataset, and the fact that the pre-existing dataset was created using a data representation that is CEDA compatible.

I found the following CEDA documentation that seems to address this phenomen:

SAS Help Center: SAS File Processing with CEDA

Thanks to everyone for their input.

Bob

Kurt_Bremser · Posted 04-24-2021 12:12 PM

One of the side effects of the macro I present at SASGF21 is that it prevents such behavior. In our batch jobs, result tables are always removed physically (if they exist) before being written out.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

rtbuttram · Posted 04-24-2021 07:41 PM

Thanks Kurt. I’ll be sure to check out your session.

Regards.
Bob

Kurt_Bremser · Posted 04-25-2021 04:10 AM

It's just a 15-minute "quick tip" type session.

Since I do not have a grid available, it would be nice to know if it can be implemented there in a reasonable fashion (either by using FDELETE() or the external rm -f).

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Table created in UTF-8 session still Latin1

Re: Table created in UTF-8 session still Latin1

Re: Table created in UTF-8 session still Latin1

Re: Table created in UTF-8 session still Latin1

Re: Table created in UTF-8 session still Latin1

Re: Table created in UTF-8 session still Latin1

Re: Table created in UTF-8 session still Latin1

Re: Table created in UTF-8 session still Latin1

Re: Table created in UTF-8 session still Latin1

Re: Table created in UTF-8 session still Latin1

Re: Table created in UTF-8 session still Latin1

SAS Innovate 2025: Call for Content

Classroom Training Available!