This is more of a curiosity question than a problem.
I have a table that gets created every night in our SAS Grid environment using a data step like the following:
data lib.tabl;
     length type $ 15;
     length descript $ 9;
     input type $ descript $;
     infile datalines delimiter=',';
     datalines;
<<imagine datalines here>>
;This code has been running for over a year with no issues.
Recently our SAS environment switched from Latin1 to UTF-8 session encoding. I’ve noticed the table created by the above code still shows “latin1 Western (ISO)” as the Encoding scheme in PROC CONTENTS. I would have expected the encoding to change to UTF-8 once our environment session encoding was changed.
I’ve tried to reproduce this behavior by intentionally creating a table with Latin1 encoding and then replacing its contents with a data step such as the above, but the result is always a table with UTF-8 encoding.
Does anyone have any idea why this older table remains in Latin1 after the change to our session encoding? Again, more a curiosity question than a problem, as I have no need to store mutli-byte characters is this table.
Thanks.
Ok. I *think* I've got a handle on this.
Earlier I said I had tried to reproduce the behavior by intentionally creating a table in latin1 and then trying to recreate it to see if it remained latin1. That wasn't exactly what I did. I actually uploaded a latin1 dataset that had been created on a Windows host, and then attempted to recreate it in a Linux environment. That reliably results in a new table being created with utf-8 encoding.
If I create a table in my Linux session with latin1 encoding by using the encoding= data step option, and then replace that table in a separate data step without specifying an encoding option, SAS recognizes the encoding of the pre-existing dataset and uses CEDA to transcode the data from utf-8 to latin1, recreating the table in its original latin1 encoding. Interestingly, the structure of the new dataset can be entirely different from that of the pre-existing dataset. The key seems to be the encoding attribute of pre-existing dataset, and the fact that the pre-existing dataset was created using a data representation that is CEDA compatible.
I found the following CEDA documentation that seems to address this phenomen:
SAS Help Center: SAS File Processing with CEDA
Thanks to everyone for their input.
Bob
If you haven't rerun the code to rebuild the table it would maintain the encoding it had when created. It isn't clear whether you have actually rerun that data step to create the table again.
Thanks for the reply.
The code runs every night as part of a scheduled batch job. PROC CONTENTS shows me that the table was created within the past 24 hours, and yet it remains Latin1. Could the system user that’s running the code be accessing SAS with a Latin1 session encoding?
I should probably just delete the table, in which case it would probably get created as UTF-8 on the next run, but it drives me nuts when the system is behaving in an unexpected manner and I can’t figure out why.
> Could the system user that’s running the code be accessing SAS with a Latin1 session encoding?
That's the first thing I'd look at: What's the configuration used by this batch job?
From what you describe it looks like "something" is overwriting the session encoding. It could be the dataset encoding option or the libname outencoding option.
Ok. I *think* I've got a handle on this.
Earlier I said I had tried to reproduce the behavior by intentionally creating a table in latin1 and then trying to recreate it to see if it remained latin1. That wasn't exactly what I did. I actually uploaded a latin1 dataset that had been created on a Windows host, and then attempted to recreate it in a Linux environment. That reliably results in a new table being created with utf-8 encoding.
If I create a table in my Linux session with latin1 encoding by using the encoding= data step option, and then replace that table in a separate data step without specifying an encoding option, SAS recognizes the encoding of the pre-existing dataset and uses CEDA to transcode the data from utf-8 to latin1, recreating the table in its original latin1 encoding. Interestingly, the structure of the new dataset can be entirely different from that of the pre-existing dataset. The key seems to be the encoding attribute of pre-existing dataset, and the fact that the pre-existing dataset was created using a data representation that is CEDA compatible.
I found the following CEDA documentation that seems to address this phenomen:
SAS Help Center: SAS File Processing with CEDA
Thanks to everyone for their input.
Bob
One of the side effects of the macro I present at SASGF21 is that it prevents such behavior. In our batch jobs, result tables are always removed physically (if they exist) before being written out.
It's just a 15-minute "quick tip" type session.
Since I do not have a grid available, it would be nice to know if it can be implemented there in a reasonable fashion (either by using FDELETE() or the external rm -f).
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
