Help using Base SAS procedures

proc import XLSX file: different results on Linux vs Windows

Reply
Occasional Contributor
Posts: 5

proc import XLSX file: different results on Linux vs Windows

When loading different sheets from an Excel-file (SAS 9.3.1M2; LIN X64) I stumbled upon a problem joining some of the tables based on a key. After some research it turned out that the key in table A had a hidden character in it after the import, but the key in table B did not. Strangely enough, when running the same program on a Windows installation (SAS 9.3.1M2; WIN X64_SRV12) this problem did not occur.  

 

The problem is obvious after using the attached Excel-file and running the following code.

proc import 
     datafile="&path./Country test.xlsx" 
     dbms=xlsx replace out=work.country_1
; 
     sheet='Country_1'; 
run; 

proc import 
     datafile="&path./Country test.xlsx" 
     dbms=xlsx replace out=work.country_2
; 
     sheet='Country_2'; 
run; 

proc sql;
	create table country as 
		select 
			c1.COUNTRY as COUNTRY_1,
			c2.COUNTRY as COUNTRY_2, 
			c1.POPULATION, 
			c2.AREA_KM2
		from work.country_1 as c1
			full join work.country_2 as c2
				on c2.COUNTRY = c1.COUNTRY
		order by c1.COUNTRY
	;
quit;

The dataset country will look as expected on the Windows installation:

 

Windows_result.PNG

 

 

 

 

 

On the Linux installation the keys do not match:

Linux_result.PNG

 

 

 

 

 

 

 

 

 

The problem appears to be the presence of the "country" consisting of three characters ("?US"). Because of this on Linux all other countries are padded with an invisible third character.

data work.country_1_check;
     set work.country_1; 
 
     length = length(COUNTRY);
     character = char(COUNTRY, 3);
     hex = put(character, $HEX32.); 
     binary = put(character, $BINARY32.);
run;

The results of this data step on Linux are strange:Linux_check.PNG

 

 

 

What causes this behaviour on Linux? Is anybody else experiencing the same thing when importing this Excel-file? Could it be the encoding on the two platforms? I noticed that the Windows installation uses wlatin1 and Linux has utf-8.

Trusted Advisor
Posts: 1,837

Re: proc import XLSX file: different results on Linux vs Windows

I have no answer to the question "what causes this behaviour".

 

You can omit the control characters using compress function with 3rd argument = c

like :   country1 = compress(country1,,'c');

 

BTWI noticed that reading an xslx file downloaded from windows 32 to linux 64,

in order to get the right date I have to subtract 60 years and 1 day, while using code like:

 

    libname myxl xlsx  '....xlsx';

   data two;

     set  myxl.sheet_name;

           date = intnx('year', date, -60) - 1;

   run;

 

Occasional Contributor
Posts: 5

Re: proc import XLSX file: different results on Linux vs Windows

Thanks for the tip of using compress with the 'c' modifier. I was not aware of this possibility.

 

Of course I would prefer to avoid this, as in reality I am reading 27 sheets from a single Excel-file. This is an input data model for a program that I am using. Worst case I can write a macro to got through all tables and compress all character variables.

 

I just think it is strange that the result of the same code is different on different installations of SAS.  

Super User
Posts: 13,583

Re: proc import XLSX file: different results on Linux vs Windows

I suspect that your last line: "I noticed that the Windows installation uses wlatin1 and Linux has utf-8."

has a bit as there is likely to be a layer of text conversion in one instance not in the other and I'm not going to open an unknown source Excel file to check.

 

Generally unless there is an overwhelming reason to try to read native Excel I save files to CSV so I control how they are read AND can see stupid things in the data easier.

 

And since Proc Import is always guessing about things then I wouldn't expect everything from two sheets in the same workbook to 'match'. Different lengths of character variables and different types depending on the values of the first rows are common issues.

 

 

Occasional Contributor
Posts: 5

Re: proc import XLSX file: different results on Linux vs Windows

[ Edited ]

I agree that reading in an Excel-file can lead to surprises, but I think that the guessing of proc import is not the problem here. In both cases the country column is correctly recognized as character. It is just treated differently and that is strange (bug?) in my opinion. Why would it leave control characters in one case, but not in the other?

Ask a Question
Discussion stats
  • 4 replies
  • 531 views
  • 5 likes
  • 3 in conversation