Re: SAS migration from Win to Linux (encoding differences)

sid3284 · Posted 07-22-2019 03:48 PM

Hi,

We are migrating from SAS 9.4 on Windows to Linux operating system.

For one of the DI jobs we see the below difference in the output of the job. This data is being pulled from a SQL server which is on Windows. We are using SAS/Access to ODBC to access data in sql server.

On Windows,

2/13/18
Sarah’s been running around like crazy; discussed the billboard next to Exit 90A, feature in the Sport & Leisure Research Group, and article in Forbes; she’s been busy putting simulators in the Mercedes Benz Stadium due to Arthur Blank’s dem

On Unix,

2/13/18
SarahÃ¢Â?Â?s been running around like crazy; discussed the billboard next to Exit 90A, feature in the Sport & Leisure Research Group, and article in Forbes; sheÃ¢Â?Â?s been busy putting simulators in the Mercedes Benz Stadium due to Arthur B

This is as a result of substring of 250 character of the record. I think the garbage characters are because of encoding mismatch.

Is there any configuration settings that I am missing. Major concern here is data loss because of encoding mismatch.

Reeza · Posted 07-22-2019 03:57 PM

The config file specifies the encoding for each system. But...Unix files usually have a different encoding than Windows.

SASKiwi · Posted 07-22-2019 04:00 PM

Run PROC OPTIONS on both servers and compare encoding options. Are they the same or different?

sid3284 · Posted 07-22-2019 04:26 PM

Its WLATIN1 for Windows and LATIN1 for Redhat.

On Redhat we changed the encoding to UTF-8 but that also did not help much.

SASKiwi · Posted 07-22-2019 05:01 PM

It would be worth asking the SQL Server DBA what the encoding is for the database. Is it running on Windows? It would be worth opening a SAS Tech support track as well.

sid3284 · Posted 07-22-2019 05:05 PM

Yes. Its running on windows.

Latin1_General_CI_AI is the collation on the table from which we are reading the data.

Patrick · Posted 07-23-2019 04:43 AM

Like @Kurt_Bremser observed that you end-up with multiple garbled bytes clearly indicates that somewhere along the line a conversion to UTF-8 must be happening (could just be for a WORK tables).

UTF-8 encoding for a right single quotation mark uses 3 bytes: 0xE2 0x80 0x99 (e28099)

Are the characters already garbled in the very first SAS table directly extracted from the DB or do they only get garbled when inserted to the target table?

Do you fully recreate the target table every single time or have you migrated it from your Windows environment? If migrated - did you use Proc Migrate or did you just copy the physical files? What's the encoding of your target table with the garbled characters?

Does the SAS log tell you anything about encoding changes/problems?

sid3284 · Posted 07-23-2019 07:55 AM

@Kurt_Bremser @Patrick The SQL server table is converted into a csv file and saved on the Redhat (compute) server. I think this is where it gets converted to UTF-8. echo $LANG on linux box gives en_US.UTF-8.

The target table is in snowflake. But i think the characters are garbled when they are written in the csv file on the unix server.

We are using file statement to write the csv on linux server. Below is the file statement.

file "/sasdata/Snowflake_temp/&sas_libname./&temp_table..csv" delimiter=',' DSD DROPOVER lrecl=32767 ENCODING='wlatin1';

We have tried both wlatin1 and latin1. But we still see the garbled character in the csv.

When I run file -bi on the csv file I get output as

text/plain; charset=unknown-8bit. which makes me think that the ENCODING option in file statement is not picked up.

Kurt_Bremser · Posted 07-23-2019 08:03 AM

Have you tried to simply create a WORK dataset from the SQL table, and look at its contents?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

sid3284 · Posted 07-23-2019 08:48 AM

2/13/18
Sarahâ??s been running around like crazy; discussed the billboard next to Exit 90A, feature in the Sport & Leisure Research Group, and article in Forbes; sheâ??s been busy putting simulators in the Mercedes Benz Stadium due to Arthur Blankâ??s

Description

This is how the data looks.

I tried to use OUTENCODING option in the libname statement but it seems it is not supported in SAS/Access to ODBC.

26 LIBNAME srccrm_1 ODBC READ_LOCK_TYPE=NOLOCK READBUFF=1000 DATAsrc=LEGENDS SCHEMA=dbo USER=sassql
26 ! PASSWORD=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX OUTENCODING ='WLATIN1';
_________
278
NOTE: Libref SRCCRM_1 was successfully assigned as follows:
Engine: ODBC
Physical Name: LEGENDS
WARNING 278-63: The option OUTENCODING is not implemented in the ODBC engine.

Tom · Posted 07-23-2019 09:37 AM

Pull the data out of Snowflake with some other tool and check if that character is a normal ASCII single quote or one of those typography slanted quotes.

sid3284 · Posted 07-23-2019 10:21 AM

@Tom the character is

146

92

2019

E2

80

99

’

Right Single Quotation Mark

https://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html

sid3284 · Posted 07-23-2019 10:24 AM

@Kurt_Bremser @Patrick Is there any other way I can set encoding in ODBC to write data as WLATIN1.

Is this available in SAS/Access to SQL Server?

Tom · Posted 07-23-2019 10:48 AM

Did you change the database that you are querying in addition to changing your SAS server? Somehow the 3 byte UTF-8 sequence for cute curly apostrophe got stuff into your database in place of the simple ASCII single quote. You should check how that table was created in the database.

Anyway it looks like Snowflake only supports Unicode (aka UTF-8).

You might try transcoding the string after copying from the database. But since it is a single byte encoding WLATIN1/LATIN1 will NOT be able to represent every possible Unicode character that might appear in your database.

Try the UNICODE() function.

https://documentation.sas.com/?docsetId=nlsref&docsetTarget=p1s61kmqg61m29n109i6dz4ivkbw.htm&docsetV...

sid3284 · Posted 07-23-2019 11:32 AM

@Tom We are reading from SQL server using SAS/Access to ODBC. We extract that data into a temporary work table on linux server and I think the data fetched in this dataset got the data converted from WLATIN1 to UTF-8. We use this temporary work dataset to create a csv and move that to s3 bucket and load it into snowflake.

We are looking for an option to maintain the same encoding on the temporary dataset from SQL server.

Registration is open