DATA Step, Macro, Functions and more

Microsoft Excel versions

Reply
Contributor
Posts: 41

Microsoft Excel versions

Hi,

 

I have Excel files, with same column names, same format and same number of rows.

One is a"Microsoft Excel file", the other a "Microsoft Excel 97-2003 file".

 

I am doing a simple proc import like this:

 

proc import datafile = "&inputdir./&input_file."
DBMS = xls OUT = &output_table. REPLACE;

data &output_table.;
	set &output_table.;
	FILENAME = "&input_file.";
run;

When importing the second one 97-2003 file, it's working, but when I do it with the first one, I have an error:

ERROR: Import unsuccessful.  See SAS Log for details.

Do you know what I should do? Can you tell me how I can write my code to avoid SAS being dependant on the Excel version?

 

Thanks.

Super User
Super User
Posts: 9,227

Re: Microsoft Excel versions

What does the log say?

I suspect its the 

DBMS = xls

See this page for compatability:

http://support.sas.com/documentation/cdl/en/acpcref/63184/HTML/default/viewer.htm#a003102702.htm

 

The way to make your code work fine each and every time is simple, drop the Excel file format totally and use plain portable text files such as CSV or XML, have the file format documented in a data transfer agreement which is signed off and adhered to by both parties.

Otherwise you will always have some issues.

Contributor
Posts: 41

Re: Microsoft Excel versions

Thanks for the answer,

 

Nothing else than what I shared in my previous post in the log.

 

 

I modified what you suggesteed in 

DBMS = xlsx

 

The import is working but then some column (not all) names are modified into "VAR1", "VAR2", ...

 

So my program is no longer working...

 

Thanks for the link, I will look at it later. 

Super User
Super User
Posts: 9,227

Re: Microsoft Excel versions

Nowhere did I mention changing it to XLSX.  Basically you have to understand that over the years MS have had several versions of their file format.  The older binary ones - XLS - are particularly bad and cumbersome being totally obfuscated, and proprietary has meant they are the least well known file formats, and the bane of any technical computer user (in fact the sooner these proprietary binary files are consigned to the past the better).  Now MS has made some steps towards being more open with it latest file format suite called Open Office (XLSX, DOCX etc.).  Basically these are zip files which contain a load of XML files.  You can see this by renaming an XLSX file to ZIP, and then double clicking on it.  These are a lot simpler to access and manipulate for third parties, although even then they do fluctuate a fair bit.  

 

So with the above in mind, avoid using old binary file formats as much as possible, and where absolutely necessary treat it as a migration from an old system to a new one and charge appropriately.  What reads one type of file may not read another, and even within the same Excel file you can get several variations - as Excel is really bad data format and proc import is guessing procedure.

 

A simpler approach to this issue may be to use VBA within Excel.  A simple VBA macro can open each of the given files you have and then save them as XSLX or CSV.  Then your SAS code can be standardised to read in this output.

 

Plenty of examples out there for the VBA:

https://www.mrexcel.com/forum/excel-questions/516366-saving-xlsm-file-xlsx-using-vba.html

Community Manager
Posts: 3,364

Re: Microsoft Excel versions

Just a short primer on the DBMS types:

  • DBMS=XLS - SAS reads the file directly, using a really old scheme of Microsoft Excel files (XLS).  Some later XLS files might have richer information than this scheme can capture
  • DBMS = EXCEL | EXCELCS -- uses Microsoft data APIs to extract info from the XLS file, so the results are more reliable. But, requires use of SAS on Windows or the PC Files Server as an external process to use those APIs
  • DBMS = XLSX - the most modern approach, reads XLSX files that have been in place since Office 2007.  Reads the files directly (not using external APIs).  Definitely the best method if the file is an XLSX file.

Each of these have different available suboptions that make sense for the file scheme.  In summary, not all Excel files are the same, and the import process requires some foreknowledge of the source file that you need to read.

 

Super User
Posts: 13,084

Re: Microsoft Excel versions


FP12 wrote:

Thanks for the answer,

 

Nothing else than what I shared in my previous post in the log.

 

 

I modified what you suggesteed in 

DBMS = xlsx

 

The import is working but then some column (not all) names are modified into "VAR1", "VAR2", ...

 

So my program is no longer working...

 

Thanks for the link, I will look at it later. 


There are a number of possible causes for the "var1", "var2" variable name issues. One is what I call "phantom data". Excel will believe that there is a populated column if the file has ever had data in a cell in a column even if all the values are deleted. Proc import uses an engine that uses information provided by Excel and so you may get a variable for a column with no actual data. If you run proc freq on Var1 and all values are missing then this is a very likely occurrence. These are usually the right-most columns in the data. Note that you have also have entire rows of data with all values missing for the same reason.

 

Another cause is that the cell in the first row is blank. Sometimes you  may not notice this as Excel will flow text over from a cell to the left. Similar is having column headings that occupy two cells vertically for some rows of data. Though this often means that your data types are screwy as the second row of "header" becomes the first row of "data".

 

Yet another possibility is header text that is longer than 32 characters (the maximum that SAS will use for variable names) and exactly matches a column to the left. Proc Import will attempt to create unique variable names from column headers with identical text by appending 1, 2, 3 as needed for shorter text but when there are longer headings the algorithm doesn't really have a good handle and will default to "varX" names.

 

Please note that Proc Import is a guessing procedure and results may be quite odd with spreadsheets (Excel) likely to yield the oddest results. If you rely on Proc Import for multiple Excel files of supposedly the same format probability approaches unity that you will have inconsistencies in the resulting SAS data sets for either variable type (numeric/character/date valued) and/or length of variables and possibly variable names.

With @RW9's suggestion you can have more control.

Super User
Posts: 9,611

Re: Microsoft Excel versions

The only sustainable solution to deal with Microsoft file formats is to not use them. They change much too often to be considered stable.

A csv is a csv is a csv, and has been so since I started working in this business 30+ years ago. Once you have a data step to read from a csv, it will work until the logical(!) structure changes, and if only new columns are added, even that won't be an issue (they're just not read).

 

To work with Excel files, use waldos behind a thick layer of bullet-proof glass.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code
Ask a Question
Discussion stats
  • 6 replies
  • 165 views
  • 0 likes
  • 5 in conversation