Re: Encoding Issue

OS2Rules · Posted 11-14-2014 03:50 PM

Hi All:

I have a funny little problem and I'm wondering if anyone has an answer.

We are in the midst of converting to SAS 9.3 (from 9.1) so we are doing a bunch of parallel runs to prove everything is the same, and there

is 1 odd little problem.

(Our SAS 9.1 system is also 32 bit and the SAS 9.3 system is 64 bit.)

There is one table that has 17 variables and 350,000+ records in it. When we compare this table after processing on each system there is 1

record difference, and only 1 variable. On the 9.1 version it contains a hex value of 'E5' and on the 9.3 system it has hex '3F'. It is the only

byte in the entire file that is different and it is on record 218,819. This was identified using a PROC COMPARE between the 2 tables on the

9.1 system. The file from the 9.3 system was "zipped", copied to the 9.1 system, and "unzipped" there.

Both systems have the same encoding (wlatin1).

Is there any explanation for this?

Patrick · Posted 11-15-2014 12:56 AM

First thing I would do is to transfer the data once more (zip, copy, unzip) and test if the issue remains. If it remains then I would use a different compression mechanism and test again. Only then I would start to "doubt" SAS.

As an update:

Strongly agree with that using cport/cimport is the way to go. If you just move your SAS files then they will remain 32-bit

38339 - SAS® file compatibility when upgrading from 32-bit to 64-bit Microsoft Windows

44047 - Format catalogs must be converted when moving from a Microsoft Windows 32-bit operating syst...

Ksharp · Posted 11-15-2014 02:44 AM

strongly recommend using proc cport and proc cimport to transfer datasets between two different computer.

jakarman · Posted 11-15-2014 04:25 AM

I tool my old System/370 (1980-s) reference summary ebcdic and than realizing you are in the ascii world.

Old ascii is 7-bit only chars up to 7f are valid. the 3f is a "?" but what the ... is that e5 ?

Even in the Latin-1 world in that Single Byte CharacterSet (DBCS) there are a lot of code-pages. the 437(mainframe)-043(windows) 850 that are all latin1 types.

SAS(R) 9.3 National Language Support (NLS): Reference Guide to enjoy all different latin1 encodings see: SAS(R) 9.3 National Language Support (NLS): Reference Guide

To make it even more confusing there are differences between the latin1's on Windows and on Unix. No not really big differences may be just one or two that are different.

Check you encoding setting on the old and new system there must be a difference in that. Code page - Wikipedia, the free encyclopedia Something like US-eng to Dutch, Norwegian or Spanish. It translated some weird char not covered by encoding differences into the question-mark

---->-- ja karman --<-----

OS2Rules · Posted 11-18-2014 08:16 AM

Jaap:

My old(er) System/370 Reference Summary Fourth Edition (November 1976) also has no character for x'e5' but it shows as a lower case 'a' with a

accent over it (umlaut?) on the screen. Everything was run on a server systems so there would be no EBCIDIC to ASCII conversion either or DBCS.

I just think that it is unusual that only 1 byte would be changed in a dataset of over 90 MB.

jakarman · Posted 11-18-2014 02:23 PM

Ok, That old time if have got traumatized by those trema-s the dead-key dilemma (diaeresis umlaut).

- First finding the dollarcent not-sign vertical-bar did exist in ebcdic not in ascii.

- Going wit all naming spellings of countries it was possible at the pc not at the mainframe.

There was limitation by the hardware (3270 terminal tube-types) not able to enter those letters. It became possible with pc-terminal emulators.

Without any input validation they come through causing printing and connection interfaces issues. That is how you notice them.

Finding one single person using that and sometimes entering it into the system.

An other event was an unplanned system down, always some time after 4 oclock. We found one man using a dedicated function as the solely person.

There was in system error in the functionality freeing the wrong memory causing that complete system down. Nice, when he went home everybody could go.

Your argument thinking only 1 byte being change in over 90Mb. Happened to me often before. Even issues of failing 1 bit in several G-bytes happened. Remembering the failing multi-volume bit continuation in SAS V6. Let us accept it can happen as it is not unusual.

What changed with the encoding between 9.1.3 and 9.3?

- You did not mention your check on similar like the 437/850 differences. (option one)

- with 9.1.3 the encoding of the OS (windows/Unix/mainframe is used) with 9.3 the encoding of the JVM is used

- latin1 Unix is different to other latin1 ones (same name other content).

What details do you have aside that only byte?

ISO/IEC 8859-1 - Wikipedia, the free encyclopedia is showing the not being defined of the lower 32 bytes up and low and the freedom of 7f.

The e5 character here is the å

The windows 1252 latin1 codepage is having more characters. Windows 1252

---->-- ja karman --<-----

Encoding Issue