04-08-2014 04:31 PM
I get all the fun! :smileyconfused: I have a task which conceptually is fairly easy, but difficult in practice. I have thousands of Excel files that consist of many tabs or worksheets. The files were created as forms for customers to fill out and return - so there are locked fields, irregular data locations, multiple rows as column headers but not necessarily starting at row 1. There are a lot of legacy forms before someone finally decided that "hey, maybe it would be a good idea to standardize". After that point, there are files with a revision control (although so far, I have identified at least 20 different revisions). Way to many files to manually process and originally I started with Python because I don't see an easy way to do this in SAS. The process in Python is pretty straight forward, but.... (and there is always a but!) the Python package uses the Open Office API to open the Excel files and it throws an error when fields or tabs are protected or locked. So it looks like Python isn't going to work for a vast majority of the files.
Here is a subset of my Python code :
for f in filelist[:]:
created = time.ctime(os.path.getmtime(f))
modified = time.ctime(os.path.getctime(f))
book = xlrd.open_workbook(f)
sheets = book.sheet_names()
if sheets == '1-Summary':
sheet = book.sheet_by_index(1)
cellC1 = sheet.cell_value(0,2)[:14]
cellC1 = 'Not valid Rev doc'
newline = f + ',' + sheet.name + ',' + created + ',' + modified + ','+ cellC1 + '\n'
This code loops through all the Excel files found in the subdirectories and opens them, checks for a sheet labelled '1-Summary' and then grabs for a text field out of a specific field which should contain the Revision control number. This produces two csv files - VerList : contains list of the readable files with the version control field listed and an exception list of files which would not open.
Once that list exists, then I can select all the files of a specific version. The data on various tabs and various locations is consistent for any given version. I have found that I can open the Excel files where Python will not, so it should work. But after numerous google searches, I have yet to see a clean way to determine what sheets exist and then if a specific sheet exists, import but import based on specific Row/Col limits.
If anyone knows of a white paper or posting along those lines, it would be greatly appreciated. Oh, and this could be Base SAS or EG, but I do have to loop across many thousands of files.
04-08-2014 05:29 PM
Been There, Done That
I learned a lot in the process...
What I ended up doing, which was not quite a decade ago, was to use DDE to open and read the file, looking for specific keywords.
Here's what I would do now, though perhaps there was some reason I couldn't do that then...
1. Find what sheets exists, use a libname statement to connect to the excel file and then use proc contents or datasets to see the sheet name. You can store this info and then filter through it.
libname sample pcfiles path='C:\Temp\Sample_v2.xlsx'; *your statement will depend on your OS/Excel version;
proc contents data=sample._all_;
2. Write the full info from the sheet to a CSV and have python sort it out, especially useful if you already have the python code available.
3. Write your SAS code to parse the text as desired.
4. You can select a specific range or entire worksheet as follows (see the last post).
04-08-2014 07:04 PM
My introduction to SAS Macro programming was a loosely similar project which involved reading the PRN files created by Lotus 123. Even when the fields were in the same order individual preference settings by a user would cause variables to take different number of spaces which meant the line that variables for one record were on varied from file to file. The prn format is a "print" format and would display the first x fields for records 1 though 45 or so. So the first record would be on line 1, continued on line 46, 91, 136 ... The total number of lines would vary depending on the individual field widths. Also, of the around 300 fields typically two thirds were blank. So there was lots of trial and error on variable widths.
The work flow ended up as 1) Identify the "most similar" files and work on them as the changes would be "minimal"; 2) create a read program for one until the results looked sensible; 3) Identify which other files that program would work on without modification; 4) make modifications to read one of the group the original program wouldn't quite read, 5) repeat 3 and 4 until the group was done; 6) start on another group.
I have disliked spreadsheet programs ever since.