08-16-2017 04:53 PM
When loading different sheets from an Excel-file (SAS 9.3.1M2; LIN X64) I stumbled upon a problem joining some of the tables based on a key. After some research it turned out that the key in table A had a hidden character in it after the import, but the key in table B did not. Strangely enough, when running the same program on a Windows installation (SAS 9.3.1M2; WIN X64_SRV12) this problem did not occur.
The problem is obvious after using the attached Excel-file and running the following code.
proc import datafile="&path./Country test.xlsx" dbms=xlsx replace out=work.country_1 ; sheet='Country_1'; run; proc import datafile="&path./Country test.xlsx" dbms=xlsx replace out=work.country_2 ; sheet='Country_2'; run; proc sql; create table country as select c1.COUNTRY as COUNTRY_1, c2.COUNTRY as COUNTRY_2, c1.POPULATION, c2.AREA_KM2 from work.country_1 as c1 full join work.country_2 as c2 on c2.COUNTRY = c1.COUNTRY order by c1.COUNTRY ; quit;
The dataset country will look as expected on the Windows installation:
On the Linux installation the keys do not match:
The problem appears to be the presence of the "country" consisting of three characters ("?US"). Because of this on Linux all other countries are padded with an invisible third character.
data work.country_1_check; set work.country_1; length = length(COUNTRY); character = char(COUNTRY, 3); hex = put(character, $HEX32.); binary = put(character, $BINARY32.); run;
The results of this data step on Linux are strange:
What causes this behaviour on Linux? Is anybody else experiencing the same thing when importing this Excel-file? Could it be the encoding on the two platforms? I noticed that the Windows installation uses wlatin1 and Linux has utf-8.
08-16-2017 05:13 PM
I have no answer to the question "what causes this behaviour".
You can omit the control characters using compress function with 3rd argument = c
like : country1 = compress(country1,,'c');
BTW, I noticed that reading an xslx file downloaded from windows 32 to linux 64,
in order to get the right date I have to subtract 60 years and 1 day, while using code like:
libname myxl xlsx '....xlsx';
date = intnx('year', date, -60) - 1;
08-18-2017 04:10 AM
Thanks for the tip of using compress with the 'c' modifier. I was not aware of this possibility.
Of course I would prefer to avoid this, as in reality I am reading 27 sheets from a single Excel-file. This is an input data model for a program that I am using. Worst case I can write a macro to got through all tables and compress all character variables.
I just think it is strange that the result of the same code is different on different installations of SAS.
08-16-2017 06:16 PM
I suspect that your last line: "I noticed that the Windows installation uses wlatin1 and Linux has utf-8."
has a bit as there is likely to be a layer of text conversion in one instance not in the other and I'm not going to open an unknown source Excel file to check.
Generally unless there is an overwhelming reason to try to read native Excel I save files to CSV so I control how they are read AND can see stupid things in the data easier.
And since Proc Import is always guessing about things then I wouldn't expect everything from two sheets in the same workbook to 'match'. Different lengths of character variables and different types depending on the values of the first rows are common issues.
08-18-2017 04:17 AM - edited 08-18-2017 01:37 PM
I agree that reading in an Excel-file can lead to surprises, but I think that the guessing of proc import is not the problem here. In both cases the country column is correctly recognized as character. It is just treated differently and that is strange (bug?) in my opinion. Why would it leave control characters in one case, but not in the other?