Reading Excel Sheets

Accepted Solution Solved
Reply
Respected Advisor
Posts: 3,887
Accepted Solution

Reading Excel Sheets

Hi all

Using SAS9.3 under Windows server with Access to PC Files I had some time ago to deal with very messy Excel Workbooks.

What I had to do is to extract specific lines from specific sheets. The challenge was that I've got workbooks from different departments which all should have been in the same structure - but of course they were not EXACTLY in the same structure. Some had some titles in the first line, some started the data only in the second column and not the first one... and so on.

On the other hand I had to implement for a regular monthly process (not a once off task). There were some "tokens" in the Excel sheets like a specific word which were good enough for me to determine the starting row and column for my data extract.

The main issue I was facing: Depending on how the Excel looked like the same column ended up as character or numeric variable coming from different sheets.

I had also to deal with different layouts (so several layouts and for each layout several Excel workbooks with several sheets) and I had to take a "trial and error" approach to find a way and import this data into SAS from these different layouts. So sometimes a "libname" with with "excel" engine worked, sometimes it was a "Proc Import".

Now I'm starting phase 2 of this project and I will get additional Excels and layouts...

What I would like to have is a "one fits all" approach. Best would be if I could read the Excel Sheets into SAS having all columns as character 255 and using the first line in the Excel sheet already as data. This way I would just have to deal with a SAS table where all variables are character (with names like A B C ....). That would allow me to write some code which looks for tokens in a consistent way.  I couldn't find a way of reading an Excel sheet into SAS this way. Any suggestions? Did I miss something?

I was also thinking about using some VB script and converting the Excels sheets to csv files - BUT: There is no MS Office installed on the server and I understand that without the Excel libraries I won't be able to get such a script working.

Up to a certain point I could get additional open source software installed on the server where I'm running SAS. Python could be an option if it comes with it's own libraries needed to convert Excel files (haven't checked yet). I'm on the other hand not sure if this is really a good idea as I fear an approach like this could then cause issues when the customer upgrades MS Excel.

And another argument against using Python or the like: It complicates the overal application maintenance as it requires additional skills. So if possible I would like to implement using "pure SAS".

Any thoughts, suggestions, ideas highly appreciated.

Attached a sample Excel Workbook (.xls). I know some of you don't trust Excel attachment - but can't avoid it in this case. The Excel Workbook contains 3 sheets. I would like a way to read this 3 sheets into 3 SAS tables so that all 3 SAS tables have exactly the same structure (3 character variables with identical attributes).

Actually: It should be a .xlsx but this site doesn't allow xlsx to be added as attachments.

Thanks

Patrick


Accepted Solutions
Solution
‎10-02-2012 09:04 AM
Super Contributor
Posts: 644

Re: Reading Excel Sheets

The simple answer is to reject all data that does not comply with the agreed standard.  And yes I have been in a big organisation where this worked on mission critical data. To be blunt, this is the most risk averse approach.  Even sheets which contained extra blank columns within the range were rejected.

However, if senior management will not back you on this here are some ideas to work on.  I don't have access to SAS software right now so I can't test code but this is what I would do

  1. Use Proc Import without the option to get column names from the sheet.  This will force SAS to treat the first row as data.  Columns will be given generic names (F1, F2, F3 from memory) but that's all right, you can rename them later.  If the cells in the first row contain alpha data (titles or  names) this should force SAS import the data in those columns as character.
  2. Work on the metadata: examine the dictionary data in Proc SQL to determine whether any columns which should have contained numeric data are shown as character data.  (Use the result to set a macro variable as a flag indicating data conversion)
  3. Write a macro to deal with different cases you encounter, based on a flag or set of flags from SQL:
  • If the column(s) you select to test are numeric as they should be (Sheet 1 in your example), use Proc Datasets to rename the table and the columns as required.  Or create a new table from the import and rename the columns as you go.
  • Otherwise assume the first row contains a title or column names and this row can be skipped.  Create a new table, use the data from the existing table with (firstobs=2) and use the input function for each row to manage type conversions.  Either in a data step or in Proc SQL.  Remember to specify formats and labels if required at this stage.

I'm not sure how SAS would import null values in the case where data is offset by an empty column A in the spreadsheet (data starting in the second column), but I suspect the resulting column would be imported as numeric.  You might need to identify 2 adjacent numeric columns in the formatted data and test the second: if headings have been inserted forcing data to be read as character and the data is displaced you will end up instead reading the first column of the pair but the result should be the same.

Finally, if you have specific values ("tokens") you can identify these can be used in a pass through the data to validate the process.

But it is still better to enforce consistency at source.

View solution in original post


All Replies
Solution
‎10-02-2012 09:04 AM
Super Contributor
Posts: 644

Re: Reading Excel Sheets

The simple answer is to reject all data that does not comply with the agreed standard.  And yes I have been in a big organisation where this worked on mission critical data. To be blunt, this is the most risk averse approach.  Even sheets which contained extra blank columns within the range were rejected.

However, if senior management will not back you on this here are some ideas to work on.  I don't have access to SAS software right now so I can't test code but this is what I would do

  1. Use Proc Import without the option to get column names from the sheet.  This will force SAS to treat the first row as data.  Columns will be given generic names (F1, F2, F3 from memory) but that's all right, you can rename them later.  If the cells in the first row contain alpha data (titles or  names) this should force SAS import the data in those columns as character.
  2. Work on the metadata: examine the dictionary data in Proc SQL to determine whether any columns which should have contained numeric data are shown as character data.  (Use the result to set a macro variable as a flag indicating data conversion)
  3. Write a macro to deal with different cases you encounter, based on a flag or set of flags from SQL:
  • If the column(s) you select to test are numeric as they should be (Sheet 1 in your example), use Proc Datasets to rename the table and the columns as required.  Or create a new table from the import and rename the columns as you go.
  • Otherwise assume the first row contains a title or column names and this row can be skipped.  Create a new table, use the data from the existing table with (firstobs=2) and use the input function for each row to manage type conversions.  Either in a data step or in Proc SQL.  Remember to specify formats and labels if required at this stage.

I'm not sure how SAS would import null values in the case where data is offset by an empty column A in the spreadsheet (data starting in the second column), but I suspect the resulting column would be imported as numeric.  You might need to identify 2 adjacent numeric columns in the formatted data and test the second: if headings have been inserted forcing data to be read as character and the data is displaced you will end up instead reading the first column of the pair but the result should be the same.

Finally, if you have specific values ("tokens") you can identify these can be used in a pass through the data to validate the process.

But it is still better to enforce consistency at source.

Respected Advisor
Posts: 3,887

Re: Reading Excel Sheets

Just to give a very late reply:

I've ended up to implement using your first approach using Proc Import reading already the first column in as data. This worked beautifully for all columns where I had a column heading so all columns ended up to be character. I've then simply wrote this variables to a text file which I've used in the next step as input for my processing (so hoping that in the future the Excels get replaced by real applications storing data in databases from which I can get extracts).

Luckily the Excel sheet without column headings got replaced so I didn't have to deal with this challenge. I still wouldn't know how to force SAS to read all columns always as character no matter what - but as said: I didn't have to solve this problem in the end so I was very happy with your first suggestion.

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 2 replies
  • 1758 views
  • 0 likes
  • 2 in conversation