BookmarkSubscribeRSS Feed
NicoM
Obsidian | Level 7

When loading different sheets from an Excel-file (SAS 9.3.1M2; LIN X64) I stumbled upon a problem joining some of the tables based on a key. After some research it turned out that the key in table A had a hidden character in it after the import, but the key in table B did not. Strangely enough, when running the same program on a Windows installation (SAS 9.3.1M2; WIN X64_SRV12) this problem did not occur.  

 

The problem is obvious after using the attached Excel-file and running the following code.

proc import 
     datafile="&path./Country test.xlsx" 
     dbms=xlsx replace out=work.country_1
; 
     sheet='Country_1'; 
run; 

proc import 
     datafile="&path./Country test.xlsx" 
     dbms=xlsx replace out=work.country_2
; 
     sheet='Country_2'; 
run; 

proc sql;
	create table country as 
		select 
			c1.COUNTRY as COUNTRY_1,
			c2.COUNTRY as COUNTRY_2, 
			c1.POPULATION, 
			c2.AREA_KM2
		from work.country_1 as c1
			full join work.country_2 as c2
				on c2.COUNTRY = c1.COUNTRY
		order by c1.COUNTRY
	;
quit;

The dataset country will look as expected on the Windows installation:

 

Windows_result.PNG

 

 

 

 

 

On the Linux installation the keys do not match:

Linux_result.PNG

 

 

 

 

 

 

 

 

 

The problem appears to be the presence of the "country" consisting of three characters ("?US"). Because of this on Linux all other countries are padded with an invisible third character.

data work.country_1_check;
     set work.country_1; 
 
     length = length(COUNTRY);
     character = char(COUNTRY, 3);
     hex = put(character, $HEX32.); 
     binary = put(character, $BINARY32.);
run;

The results of this data step on Linux are strange:Linux_check.PNG

 

 

 

What causes this behaviour on Linux? Is anybody else experiencing the same thing when importing this Excel-file? Could it be the encoding on the two platforms? I noticed that the Windows installation uses wlatin1 and Linux has utf-8.

4 REPLIES 4
Shmuel
Garnet | Level 18

I have no answer to the question "what causes this behaviour".

 

You can omit the control characters using compress function with 3rd argument = c

like :   country1 = compress(country1,,'c');

 

BTWI noticed that reading an xslx file downloaded from windows 32 to linux 64,

in order to get the right date I have to subtract 60 years and 1 day, while using code like:

 

    libname myxl xlsx  '....xlsx';

   data two;

     set  myxl.sheet_name;

           date = intnx('year', date, -60) - 1;

   run;

 

NicoM
Obsidian | Level 7

Thanks for the tip of using compress with the 'c' modifier. I was not aware of this possibility.

 

Of course I would prefer to avoid this, as in reality I am reading 27 sheets from a single Excel-file. This is an input data model for a program that I am using. Worst case I can write a macro to got through all tables and compress all character variables.

 

I just think it is strange that the result of the same code is different on different installations of SAS.  

ballardw
Super User

I suspect that your last line: "I noticed that the Windows installation uses wlatin1 and Linux has utf-8."

has a bit as there is likely to be a layer of text conversion in one instance not in the other and I'm not going to open an unknown source Excel file to check.

 

Generally unless there is an overwhelming reason to try to read native Excel I save files to CSV so I control how they are read AND can see stupid things in the data easier.

 

And since Proc Import is always guessing about things then I wouldn't expect everything from two sheets in the same workbook to 'match'. Different lengths of character variables and different types depending on the values of the first rows are common issues.

 

 

NicoM
Obsidian | Level 7

I agree that reading in an Excel-file can lead to surprises, but I think that the guessing of proc import is not the problem here. In both cases the country column is correctly recognized as character. It is just treated differently and that is strange (bug?) in my opinion. Why would it leave control characters in one case, but not in the other?

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 2362 views
  • 5 likes
  • 3 in conversation